Thanks for the question, bhashyakar.

Asked: May 10, 2001 - 6:55 am UTC

Last updated: July 16, 2013 - 4:38 pm UTC

Version: 8.1.6

Viewed 10K+ times! This question is

You Asked

why is a check point necessary ?what is its acutual purpose?
why is check point necesary during instance recovery when u have a sid?

and Tom said...

a checkpoint is the act of flushing modified, cached database blocks to disk. Normally, when you make a change to a block -- the modifications of that block are made to a memory copy of the block. When you commit -- the block is not written (but the REDO LOG is -- that makes it so we can "replay" your transaction in the event of a failure). Eventually, the system will checkpoint your modified blocks to disk.

there is no relationship between "checkpoint" and sid and instance recovery does not imply "checkpoint". a checkpoint reduces the amount of time it takes to perform instance recovery.

Rating

(107 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

instance recovery

A reader, August 13, 2002 - 8:03 am UTC

Hi Tom

I dont understand how instance recovery is reduced by checkpoint

For example:

database no archive log

1. I have 3 redo log groups
2. I have 10 uncomitted transactions, 3 in redo 1 , 3 in redo 2 and 4 in redo 4
3. the uncomitted data are written to datafiles
4. redo 1 is overwritten by new redo
5. instance failre
6. we start database

instance recovery starts:
rolling forward to recover data that has not been recorded in the datafiles, yet has been recorded in the online redo log, including the contents of rollback segments.

Marking all transactions system-wide that were active at the time of failure as DEAD. I dont understand how Oracle can know what transactions were active in the time of failure? I mean for example my uncomitted transactions in redo 1 has been overwritten then from where does Oracle obtain information about these 4 uncomitted transactions?

thank you!

August 13, 2002 - 5:11 pm UTC

Isn't it obvious that:

...
instance recovery starts:
rolling forward to recover data that has not been recorded in the datafiles, yet
has been recorded in the online redo log, including the contents of rollback
segments.
...

would take less time if there was 100 blocks not recorded in the datafiles versus 10000? checkpointing is the act of flushing those blocks to disk, making it so we need not recover them upon instance failure.... the less blocks we have to recover, the less time it'll take.

The information is in the rollback segments (which we just recovered) as to what was going on at the time of failure. The details really are not relevant, the fact that it does know is.

from rollback segments

A reader, August 14, 2002 - 12:40 am UTC

Hi

Does Oracle get dead transaction information from RBS segment headers? From that small transaction table in RBS header I mean...

August 14, 2002 - 7:53 am UTC

yeah

A reader, August 14, 2002 - 12:52 am UTC

...
instance recovery starts:
rolling forward to recover data that has not been recorded in the datafiles, yet

has been recorded in the online redo log, including the contents of rollback
segments.
...

Since at every log switch there is an implicit checkpoint which consequently triggers CKPT process to update the file headers and CKPT then triggers DBWR to flush dirty blocks to disk does this mean that Oracle always read the last current and active redo only therefore speeding up instance recovery? And that's why it does not matter if older redo logs have been overwritten since uncomitted transactions informations can be obtained from rollback segments transaction table and can be rollback by reading that? (no redo is needed to rollback these uncommitted transaction)

August 14, 2002 - 7:55 am UTC

No, because the checkpoint need not have completed. Say we have 10 log files. When we switch from 1 to 2 then 2 to 3 then .... then 9 to 10 -- now we want to switch from 10 to 1. The blocks protected by 1 MIGHT NOT YET BE CHECKPOINTED. You might get a message "checkpoint not complete cannot allocate new log" in the alert, the system will pause whilst the checkpoint completes.

We might have to read every redo log file upon crash recovery.

We will not reuse a redo log until the blocks it "protects" are flushed to disk and once they are we no longer need that redo (that is why it is OK for older redo logs to be reused)

Checkpoint Queue

A Reader, September 16, 2002 - 8:06 am UTC

What is Checkpoint Queue? What is the underlying mechanism for Oracle checkpointing. How internally and in how many steps it does it.?

Please give valuable advise.

September 16, 2002 - 9:28 am UTC

valuable advice = read the server concepts guide, chock full of really cool information.

checkpoint

atul, October 09, 2002 - 7:25 am UTC

Sir,

At checkpoint dirty blocks from database buffer are written to the datafiles.Now for example an uncommitted transaction
like the update statement in my example is present in the buffer ,
it will be subsequently written to the data file at checkpoint.
So now datafile contains some uncommitted data.Isn't it ??

Thanks.
atul

October 09, 2002 - 4:32 pm UTC

yes, datafiles are "fuzzy" -- pretty much all of the time

A key piece of knowledge here is that BEFORE DWBR will flush the blocks to disk -- it will signal LGWR to flush the redo.

So, if we write a block that contains UNCOMMITTED data (happens all of the time) and we "crash" right then - before the commit -- we are OK.

o upon restart we recover using the REDO
o this recovery includes recovering ROLLBACK (UNDO)
o we can then rollback the uncommitted changes from the ROLLBACK after roll forward recovery

checkpoint

atul, October 10, 2002 - 3:37 am UTC

Sir,

# When a DML is issued , there are two blocks in the database buffer

a) A modified data block that contains new value & a block of rollback segment that contain old image.
b) Also Redo ( buffer ) is generated that contains old & new image (values).

CASE 1: THERE IS NO COMMIT.
# Users are reading data from rollback blocks ( old image )
Case2: Commit happens ?? then they should read data from buffer that contains new values & rollback blocks should disappear.

How does it happen ?

Thanks.
atul

October 10, 2002 - 6:56 am UTC

If you have my book "Expert one on one Oracle" -- read the architecture, redo, rollback and transaction sections. I go into this in detail.

Short of that, read:

</code> http://docs.oracle.com/docs/cd/A87860_01/doc/server.817/a76965/c23cnsis.htm#2570 <code>

just because the user committed does not imply at all that the rollback is no longer needed. You need to understand our multi-versioning and consistent read mechanisms. that undo data might be needed for a long time after the user commits.

Few Clarifications

Santhosh, October 10, 2002 - 7:32 am UTC

Hi Tom,

Just a clarification on the answers u gave...

Quote your answer -
1. "When you commit -- the block is not
written (but the REDO LOG is -- that makes it so we can "replay" your transaction in the event of a failure)."

Question : Ok, when a user commits the data, if the data is not immediately written on to the block, then how is concurrency handled ? On commit from User A, if user B wants to query the data just saved, then how is he able to get the updated data ??

2.
"...just because the user committed does not imply at all that the rollback is no longer needed.... that undo data might be needed for a long time after the user commits."

Question : It was assumed that once a user commits the data, the rollback is cleared. But, as per ur answer u r saying that it "might be needed for a long time after the user" ? Could u explain it's purpose ?

Rgds,

Santhosh

October 11, 2002 - 7:03 pm UTC

1) from the buffer cache of course, we look there for the data before going to disk.

2) Read

</code> http://docs.oracle.com/docs/cd/B10501_01/server.920/a96524/c21cnsis.htm#17882 <code>

you'll see that rollback (undo) data is used when you SELECT data as well -- it would be used to provide the read consistent view of the data for a long running query.

not clear this part

A reader, March 03, 2003 - 10:22 am UTC

you said

"No, because the checkpoint need not have completed. Say we have 10 log files. When we switch from 1 to 2 then 2 to 3 then .... then 9 to 10 -- now we want to switch from 10 to 1. The blocks protected by 1 MIGHT NOT YET BE CHECKPOINTED. You might get a message "checkpoint not complete cannot allocate new log" in the alert, the system will pause whilst the checkpoint completes.
We might have to read every redo log file upon crash recovery.
We will not reuse a redo log until the blocks it "protects" are flushed to disk and once they are we no longer need that redo (that is why it is OK for older redo logs to be reused)"

I dont understand this fully, for example I have 4 redo log groups and an uncommitted transaction recorded in the first group if the corresponding dirty blocks are flushed to disk and the first group has been overwritten by new redo vectors what happens if instance fails? I assume the redo logs are needed to rebuild the RBS of my uncommitted transaction but is that information to rebuild the RBS available anywhere? Or Oracle simply reads ths RBS from the datafiles in disk? (No need to rebuild the RBS for this specific transaction?)

So in resume, I guess Oracle does roll forward using the redo logs but it also reads RBS (NOT rebuilt using the redo) to rollback my uncommitted transaction. Am I correct? Or all RBS are rebuilt using redo logs?

March 03, 2003 - 10:28 am UTC

... if the corresponding dirty
blocks are flushed to disk and the first group has been overwritten by new redo
vectors what happens if instance fails? ....

then when we recover we would:

a) roll forward from all changes in the redo logs. this INCLUDES rolling "rollback" forward -- we generate redo for undo

b) roll back any uncommitted changes using the UNDO information.

In fact -- even of the corresponding dirty blocks WERE NOT flushed to disk -- the recovery process would have recreated those same dirty blocks AND THEN rolled them right back again. So no matter if flushed or not, we would have "gotten them again"

thx for prompt reply

A reader, March 03, 2003 - 11:39 am UTC

Hi again

Thank you very much for prompt reply.

you said
"a) roll forward from all changes in the redo logs. this INCLUDES rolling "rollback" forward -- we generate redo for undo"

however what I meant initially is what happen if the redo to build the rollback is overwritten? I mean if we always redo to rebuld RBS then what happens if I have a uncommitted transaction if it does not commit for long timeand the redo of this transaction and the redo of RBS related to this transaction are overwritten?

March 03, 2003 - 11:43 am UTC

the rollback would have been flushed to disk just like any other data.

If there is UNDO that is protected by REDO -- the redo will not be reused until those dirty UNDO blocks are likewise flushed to disk -- just like data.

So, we'll overwrite the REDO after the UNDO is safely on disk and no longer needs it.

Commited data not "guaranteed" ?

David, March 05, 2003 - 6:53 pm UTC

Tom, I was thinking of a scenario that sometimes scares me...

**From a database perspective** -- theoretically -- when data is commited it inevitably goes to the redo log files on disk.

However, there are other layers between the database and the hardware. I mean, the commited data doesn't go "directly" to disk, because you have "intermediate" structures like i/o buffers, filesystem buffers, etc.

1) What if you have commited and the redo data has not yet "made it" to the redo log. In the middle of the way -- while this data is still in the OS cache -- the OS crashes. So, I think, Oracle is believing the commited data got to the redo logs -- but is hasn't in fact **from an OS perspective**. It just "disapeared" while in the OS cache. So redo would be unsusable. Is it a possible scenario ?

3) I imagine the so-called journaled file systems can recover from this kind of crash, but how about non-journaled file system and systems that use raw devices ?

4) Could you point me to some links or threads in this regard ?

Thanks

March 05, 2003 - 7:34 pm UTC

the data does go to disk. We (on all os's) use forced IO to ensure this. We open files for example with O_SYNC -- the os does not return "completed io" until the data is on disk.

It may not bypass the intermediate caches and such -- but -- it will get written to disk when we ask it to.

1) that'll not happen. from an os perspective, it did get to disk

3) not a problem, the data is on disk.

4) not sure -- anyone else have any pointers?

JFS

Jim, March 05, 2003 - 10:32 pm UTC

I first heard about JFS at this link
</code> http://www.linuxgazette.com/issue55/florido.html <code>

A quick search on google.com returned a stack of
sites

Regards
Jim

datafile - data and header

ygp, August 07, 2003 - 11:37 am UTC

In your book you mention that LGWR (now, may be, CKPT) writes the headers of the datafiles; whereas the data in the datafiles are written by the DBWR / DBWn.

Does not this increase to the need of locking / latching ?
(One moe point of contention ?!)

August 09, 2003 - 4:55 pm UTC

no, it is just another guy doing it -- freeing lgwr to go back to its job that much sooner.

another question....

A reader, August 11, 2003 - 1:23 pm UTC

Tom,

ver: 8.1.7.4

Is if the log file size is 512M. Log_check_point_interval is
25000. log_check_point_timeout = 0.

db_block_buffers= x = fast_start_io_target.

I am once checkpointing around 12M (along with regular checkpointing) . That is i am checkpointing around 50 times per log file with this one only. Don't you think its really too high to get "check point not complete " error message....

Further i am switching the log file once every 4-5 minutes during peak hr. so with overiding checkpoint priority might be i am not completing some of the checkpoints with higher priority ....

with dbwr = 2 and with db_block_buffers= x = fast_start_io_target are my DBWR not that active to flush the dirty blocks as quickly as possible.

I know i have to add more redo and might be the size also...
but....

q2) with checkpointing not complete will the database freeze for hours with out any trace or error messages to alert?

Thanks,
Kasturi

August 11, 2003 - 2:18 pm UTC

sorry -- most of that didn't really make clear sense.

checkpoint not complete will basically freeze the instance whilst dbwr rushes to catch up-- the message IS written to the alert log.

question

A reader, August 11, 2003 - 2:33 pm UTC

Tom,

Sorry if i was not clear enough. I was just trying to understand the below:

I was refering to Doc Id 147468.1

where

"The LOG_CHECKPOINT_INTERVAL init.ora parameter controls how often a checkpoint
operation will be performed based upon the number of operating system blocks
that have been written to the redo log. If this value is larger than the size
of the redo log, then the checkpoint will only occur when Oracle performs a
log switch from one group to another, which is preferred.

NOTE: Starting with Oracle 8.1, LOG_CHECKPOINT_INTERVAL will be interpreted
to mean that the incremental checkpoint should not lag the tail of the
log by more than log_checkpoint_interval number of redo blocks.

On most Unix systems the operating system block size is 512 bytes. This means
that setting LOG_CHECKPOINT_INTERVAL to a value of 10,000 (the default
setting), causes a checkpoint to occur after 5,120,000 (5M) bytes are written
to the redo log. If the size of your redo log is 20M, you are taking 4
checkpoints for each log.

............."

and "Starting from Oracle8i, Oracle Corporation recommends that Enterprise
Edition users who were using incremental checkpoints in an earlier release
to use fast-start checkpointing in Oracle8i. In fast-start checkpointing, the
FAST_START_IO_TARGET parameter replaces DB_FILE_MAX_DIRTY_TARGET.
FAST_START_IO_TARGET specifies the number of I/Os that should be needed during
crash or instance recovery.When you set this parameter, DBWR writes dirty buffers out
more aggressively, so that the number of blocks that must be processed during recovery
stays below the value specified in the parameter.

So in Oracle8i The incremental checkpoint position should not lag the tail of the
log by more than LOG_CHECKPOINT_INTERVAL operating system blocks.The
LOG_CHECKPOINT_INTERVAL and FAST_START_IO_TARGET is taken into account
to determine how far behind the end of the redo stream the checkpoint position can really be.

In Oracle9i FAST_START_MTTR_TARGET is the preferred method of specifying
how far the checkpoint position should be behind the tail of the redo stream. However,
LOG_CHECKPOINT_INTERVAL is still supported if needed. It functions as per the Oracle8i
behaviour above.

"
Thanks,

Kasturi

cont../

A reader, August 11, 2003 - 2:40 pm UTC

While refering to checkpoint priority:

I am refereeing to Global check points/local check points /File check points with Fast/Slow/Over ride options of check pointing.

Thanks,
Kasturi.

August 11, 2003 - 3:14 pm UTC

what I meant was -- I'm not really sure what you are asking.

A reader, August 11, 2003 - 3:33 pm UTC

Tom,

My question is if i am checkpointing as frequently as 50 times per log file along with other checkpoints(as my log files are switched once for every 4-5 minutes during peak hr), that "i am getting checkpoint not complete error"? or is it purely becuase of my redo log files which are tiny?

Thanks,

August 11, 2003 - 5:04 pm UTC

it just means that the checkpoints are *not completing*, you are not doing 50 -- they are not finishing that fast.

you either want more logfiles or larger ones

checkpoint process and datafiles

A reader, November 18, 2003 - 8:24 am UTC

Tom,
Checkpoint process updates the file headers of the data files.
What does it update the file headers of the data files with?

Thank you

November 21, 2003 - 7:48 am UTC

the fact the checkpoint completed there

checkpoint scn

Shankar, December 01, 2003 - 3:20 pm UTC

Tom, Do the start and end scn in redolog files correspond to checkpoint scn on the file headers? Is this how oracle knows which log file needs to be applied in case we need to restore a datafile from backup and perform recovery? Thanks.

December 02, 2003 - 8:10 am UTC

well, the control files are involved as well, but conceptually -- sure.

alter system checkpoint

Reader, January 24, 2004 - 11:36 am UTC

Tom, what is the use of the command ALTER SYSTEM CHECKPOINT as we have incremental checkpoints these days? I was wondering when this command would be useful? Thanks.

January 24, 2004 - 1:21 pm UTC

it is not really. you can use it to force a checkpoint and wait for it to complete.

alter system checkpoint

Reader, January 25, 2004 - 12:23 pm UTC

Tom, in what circumstances this command is useful for DBAs, say in 8i/9i version of oracle? Thanks.

January 25, 2004 - 12:38 pm UTC

see above. one comment up.

Checkpoint not complete

Ken Lau, January 26, 2004 - 5:31 am UTC

As you say, if the checkpoint for every log file switch cannot complete on time, all transactions will halt until the earlies checkpoint complete. If this situation really occur, the following problem might exist:
- the I/O is slow.
- the redo log file size is small.

To fix the problem, we may:
1. increase the no. of log group.
2. increase the size of 1 (or every) log group.

So, is there simple guildline to determine weather I should use method 1 or 2?
Or there is method 3,4,5...etc?

January 26, 2004 - 6:21 am UTC

you forget fix #3 -- fix your IO

to determine what to do, you look at your redo layout and ask yourself

o are the disks going as fast as they should. does iostat or perfmon show me going at the right/reasonable speed there.

o are the disks suffering from contention that I can alleviate

o are the programs just committing too frequently -- like every row instead of on transactional boundaries. or am I doing a load with 'commits' in the middle. I can reduce the amount of redo the system generates by committing on transactional boundaries, not row by row (in fact, i'll probably fix quite a few bugs as well doing that)

o are the programs doing things row by row instead of set based. row by row generates lots of additional redo. turning a row by row process into a set operation could cut redo generation by 50% or more

o and then lastly you can decide yourself whether you want to
a) have more groups -- meaning you just extend out the period in which dbwr has to flush the buffers

b) have larger members -- meaning dbwr might get even further behind since it has more to checkpoint but is getting told to do it less often

c) use fast_start_mttr_target (io_target in earlier releases) to have the data trickled out continously.

incremental checkpoint

Sam, April 17, 2004 - 7:44 pm UTC

My understanding is that CKPT process updates the controlfile every three seconds the current checkpoint position. Does it mean that by default incremental checkpoint occurs every 3 seconds by asking DBWR process write some number of blocks from buffer cache to data files? Thanks.

April 17, 2004 - 7:56 pm UTC

no it doesn't.

Instance Recovery

Jayesh, May 20, 2004 - 12:03 am UTC

What are the different activities taking place in instance Recovery ? The order of execution.. Which processes are involved ? What buffers they are using ?

May 20, 2004 - 10:58 am UTC

basically you need to know:

a) roll forward from the logs, read online redo's, replay anything that was not checkpointed. the same buffers are used as you use every day (the sga cache)

and then

b) roll back from the undo tablespace -- undo inflight transactions that were not committed at the time of the failure. same buffers you use every day for that.

you need to wait for a) to complete, b) can be done while people are logging in and running stuff.

Instance Recovery

Jayesh, May 20, 2004 - 12:20 pm UTC

I wanted to know which process is responsible for reading from the logfiles, which buffer it is using (DB buffer or Log Buffer) for applying to data files.

May 20, 2004 - 12:27 pm UTC

log buffer is used to buffer data to be written to the redo log files.

</code> http://docs.oracle.com/docs/cd/B10501_01/server.920/a96524/c09procs.htm#3276 <code>
SMON does the recovery. it just reads redo and recovers blocks into the buffer cache or disk if they don't fit.

(concepts guide might be something you want to read from cover to cover, very very good document, answers lots of architectural questions)

incremental checkpoint

David, May 22, 2004 - 12:41 pm UTC

(1) Assume I have created a new 9i database with default values for checkpoint related parameters like log_checkpoint_interval and the like. Then, users do concurrent transactions. My question is, does oracle do incremental checkpoint with the default settings? If so, how often does the incremental checkpoint happen? How do I find it is happening? Thanks.

(2) Does DBWR do checkpoint every three seconds by default? If so, is this called incremental checkpoint? Thanks.

You are the best.

May 22, 2004 - 5:25 pm UTC

log checkpoint interval and timeout default to 0/1800 respectively. (0 and 30 minutes)

effectively "they are not set by default", so no, the incremental checkpoings are not happening since it is doubtful 30 minutes would be the driving factor (and if it is, you have a pretty slow system anyway)

dbwr wakes up every 3 seconds and looks around, but does nothing unless it needs to.

incremental checkpoint

Reader, May 22, 2004 - 9:05 pm UTC

I was reading your answers above. Then, what is meant by incremental checkpoint and how often DBWR does it? You say DBWR wakes up every 3 seconds and at that time does it do checkpoint? Thanks.

May 23, 2004 - 9:26 am UTC

the log_checkpoint (and/or fast_start_*) parameters control dbwr's checkpointing of data. if you set them aggresively, it'll incrementally checkpoint to ensure the documented meaning of those parameters is satistifed.

without them being set, it'll not be obligated to do anything incrementally.

dbwr checkpoints only when it has to, not every 3 seconds. the 3 seconds is just a wake up and see if everything is OK sort of processing (see if anything needs be done).

Uncommited Data to Disk.

Vivek Sharma, December 20, 2004 - 6:45 am UTC

Dear Tom,

Suppose I update a record, update emp set sal=2000 where empno=123 and sal=1000;

it will have the redo and undo of the transaction as well as the redo
of the changes made to the rollback segments block.

Now you say,
"If there is UNDO that is protected by REDO -- the redo will not be reused until those dirty UNDO blocks are likewise flushed to disk -- just like data."

Suppose, I have 3 redo logs and all these redo/undo are in log file 1. During log switches these are written to disk but the transaction is not commited. Since, the transaction is not commited and both the undo and redo are to be overwritten then

1. what should be the value of sal in this emp table (on disk). I think, as per the explanation given by you, it should be 1000 since undo blocks are also flushed to disk.

2. Suppose, the user commits now, what happens ? The redo was flushed to log file which is overwritten, the dirty blocks have been flushed to disk and overwritten. We only have the undo image in the rbs which was not overwritten since the transaction was not commited.

I need some clarification on your statement and hence would request you to explain in some detail.

Thanks and Regards
Vivek Sharma

December 20, 2004 - 8:33 am UTC

Undo would never be in log file 1, redo is the ONLY thing in redo log files.

undo is stored in the rollback segments, it is NOT available to be overwritten until you commit.

how actually dbwr works?

aman, December 20, 2004 - 9:36 am UTC

sir i am reading this thread and the documentation besides.I am sorry if my question is going to sound stupid but still i have this doubt.my doubt from reading the documentation is that what happens when do not commit and just rollback?then is the DBWR going to write the whole information back from RBS or from DBC or it will not do anything as the datablocks are not affected since all the transaction happened logically uptil now?
thnx
you are the best among the bests !!

December 20, 2004 - 9:41 am UTC

DBWR doesn't get involved in commit/rollback work.

DBWR writes dirty blocks from the buffer cache to disk -- that does not happen when you "commit" -- LGWR is involved during a commit.

a bit more information

aman, December 20, 2004 - 10:41 am UTC

sir you are right!but then what happens when a rollback is issued?the changed info is there in the REDO as well as in the DBC too!the old info is there in the RBS!if rollback is there then what will happen to that data tht is there in the DBC and REDO?i know that DBWR wont take part in commit and rollback but hw blocks are written when there is a rollback?is there an overwrite of info which is already there in the datafiles as nothing is changed in them?
thanx

December 20, 2004 - 1:43 pm UTC

REDO is only used for instance and media recovery, we do not read from it to rollback. it is used after a 'crash' or after a disk fails.

when you issue rollback, we read the rollback data -- undo data -- we do not touch REDO.

do you have access to my book "Expert One on One Oracle" -- I cover this in many pages of text (pictures and all :)

RE : Log switch

A reader, February 24, 2005 - 3:55 pm UTC

Tom,

I have the following very basic question :

I have 2 log files and database and NOARCHIVELOG.
1.Suppose the process completes writing to log file #1 and switches to log file #2.
2.Now a log switch takes place which triggers a checkpoint process.
3. Since one of the occasion on which DBWR writes modified buffers to disk is during checkpoint and since DBWR signals LGWR to flush the redo to disk, this process seems to be catch22 situation. Can you please clarify me how these processes really work. I have read the pages you had mentioned in your book, but still not clear

February 24, 2005 - 5:42 pm UTC

why is it catch 22?

alt, February 25, 2005 - 5:19 am UTC

could you tell me what all are things happening in oracle architechure while inserting a record to the table
ie
1)The table having 10 records &i want to insert two more records
2)I want update a particular value

February 25, 2005 - 5:11 pm UTC

read the concepts guide

</code> http://docs.oracle.com/docs/cd/B10501_01/server.920/a96524/toc.htm <code>

if you like the way I explain things, get "Expert one on one Oracle"

it takes more than a couple of paragraphs to go through.

Redo,Undo,Checkpoint

A reader, March 03, 2005 - 5:01 pm UTC

Very Good.
But sometimes , the author does not seem to have understood
what the questioner is really asking, despite their being clear. But that's only sometimes.
BUt the responses are amazingly good in general.

"Reverse" recovery - why isn't it possible?

A reader, May 09, 2005 - 2:47 pm UTC

Tom,
Sorry if this sounds incredibly dumb - but I'll ask anyway!

Why doesn't Oracle allow a "reverse" database recovery? Is it just very difficult to implement - or does that make no sense at all? Either way, it'll be great if you could help me understand.

Wouldn't database recoveries, be that much faster with a "reverse recovery"? For example in a point in time recovery: All that RMAN would have to do is restore a level 0, apply a level 1, "figure out" that it's SCN are ahead of what is required..and that it ought to "reverse" recover 1 archivelog for consistency and voila we're done.

AS AGAINST - restoring a level 0 and applying 500 archivelogs as part of "forward" recovery.

May 09, 2005 - 4:00 pm UTC

</code> http://docs.oracle.com/docs/cd/B14117_01/server.101/b10734/rcmflash.htm#1020720 <code>

10g, flashback database.

reusing redologfiles

Puja, September 14, 2005 - 7:53 am UTC

HI Tom,

I understand that the redologfile groups cannot be reused unless the checkpoint that flushes the corresponding dirty data is completed. Along with this condition, is there any other reason why the redologfile group will not be overwritten? To explain my query, i would take this scenario:

1. i have 2 redologfile groups rg1, rg2. both are small-sized
2. i have a very long running transaction.
3. the transaction starts, and the corresponding redo records are written to rg1.
4. the rg1 group gets filled up, logswitch takes place. The corresponding dirty buffers are written to the respective datafiles.
5. The transaction starts writing redo records to the group rg2.
6. The group rg2 is filled up. log switch takes place. The checkpoint responsible for flushing out dirty data corresponding to rg1 is already completed.

My question is that:
The transaction is not yet over. Will it be allowed to overwrite rg1 with the new redo records? If yes, then what would happen if there is an instance failure now? if no, then what kind of error/problem would the database encounter?

Thanks and regards,
Puja..

September 14, 2005 - 8:55 am UTC

another reason would be that you are in archive log mode - but haven't have these files archived yet.

so, rg1 might not be able to be reused if arch hasn't archived it.

the instance would not fail, your transaction would be placed on hold until the archiver finished.

what if dataabse is in noarchivelog mode

Puja, September 14, 2005 - 9:08 am UTC

Instance recovery works well even if the database is in noarchivelog mode. That means the rules that are applicable to the db in archivelog mode are the same for that in noarchivelog mode as well. The instance recovery phase never looks for the archive log files, only the online redologfiles. The only difference would be that the ckpt process now needs to take care of archiving as well. My question remains the same:
Will it overwrite the first redolog group, despite the transaction still being in-doubt one.

September 14, 2005 - 9:32 am UTC

yes, transactions can span logs and transactions can generate more redo than you have online redo logs to hold.

I would not call this "in doubt", that is a term used in a distributed transaction with a specific meaning. It is more like "in process".

But yes, you can run a transaction that generates 1GB of redo - but only have 10MB of redo configured online. It won't be fast but you can do it :)

Thanks...

Puja, September 15, 2005 - 1:13 am UTC

It was helpful..

Thanks and Regards,

Puja

A small doubt regarding the same..

Puja, September 15, 2005 - 1:28 am UTC

Hi..
Thanks for all the replies to my query regarding Reusing online redologfiles..

I am now convinced with the fact that the first loggroup would be reused once the corresponding checkpoint is finished. Now the only doubt that remains is that in case now the instance fails, then how would the recovery of the transaction take place whose redo has been overwritten? Will it go to to Rollback segment and if yes, then from where is getting the transaction id?
Thanks and Regards,

Puja

September 15, 2005 - 7:45 am UTC

we need not recover (roll forward) the blocks that were checkpointed (they are rolled forward already)

All we need to do is apply any remaining online redo log to bring the database to the point in time right before the instance failure (this recovers UNDO as well - UNDO is protected by REDO) and then rollback any transaction that has not committed.

Undo protected by redo

Saibal Ghosh, October 26, 2005 - 6:43 am UTC

Undo is protected by redo just like any other data-- table, index etc.You have mentioned that several times in this thread as well as in your book. That's clear. My question is when the rollback segment is buffered, has the undo already been added to the segment and only the redo of the undo generated in the buffer cache or only the rollback segment is bufferd and it has undo and 'redo of the undo'--so to say appended to it in the database buffer cache?
After reading your book Expert one-on-one, my understanding is that undo information is appended to the rollback segment and then buffered. But which process does this and where in the SGA is this done? Thanks in advance.

October 26, 2005 - 12:20 pm UTC

as undo is generated - the redo for it is likewise generated.

the undo is hopefully "always buffered", the redo is generated upon the creation of it.

Not getting it

Saibal Ghosh, October 27, 2005 - 12:24 am UTC

Sorry Tom, I didn't fully get it. In your book expert one-on-one page 151, para 1, line 9 you have written:
"Undo data is added to the rollback segment, and is cached in the buffer cache just like any other piece of data would be. Like the undo generated redo may consist of many pieces of information as well."
The import of the above seems to be:
Undo information is generated and added to the rollback segment BEFORE it is buffered and naturally redo for such undo has to be generated AFTER the undo has been generated.
My question concerns specifically:
1) When exactly is the undo information generated?
2) Where(in SGA) is the undo information added to the rollback segment--I can't believe the undo information is appended when the rollback segment blocks are still on the disk!
Thanks Tom for your time and infinite patience

October 27, 2005 - 6:43 am UTC

the undo information is generated and cached *at the same time*. undo is generated into database blocks, the database blocks are buffered.

It is just like inserting a row into a table. No much different.

Undo is generated into a database block, database blocks are protected by redo.
Inserts work on a database block, database blocks are protected by redo.

A rollback segment is a lot like a database table itself.

RE: checkpoints and Mviews

Bakunian, October 28, 2005 - 12:17 pm UTC

Tom,

I have 9 redo log groups x 100 Mb in a DSS environment, oracle 9iv2 on Sun with SAN storage raid 10. So pretty fast. Database parameters as follows. My understanding was that checkpoints should occure based on fast_start_mttr_target = 30 however they are not. The problems begis at 4 AM when Mview begin COMPLETE refresh and the hell breaks loose and all checkpoints are not complete. Is there anything else I can do? BTW mviews built with NOLOGGING, PARALLEL. Database is in NOARCHIVELOG. Any input greatly appreciated.

log_checkpoint_interval = 0
log_checkpoint_timeout = 0
fast_start_io_target = 0
fast_start_mttr_target = 30

Fri Oct 28 03:40:58 2005
Thread 1 advanced to log sequence 22965
Current log# 9 seq# 22965 mem# 0: /d01/redo09A.log
Fri Oct 28 03:42:42 2005
Thread 1 advanced to log sequence 22966
Current log# 2 seq# 22966 mem# 0: /d01/redo02A.log
Fri Oct 28 03:44:16 2005
Thread 1 advanced to log sequence 22967
Current log# 3 seq# 22967 mem# 0: /d01/redo03A.log

October 28, 2005 - 1:20 pm UTC

checkpoints are initiated by log switches regardless - this is just saying "whoa...."

I don't see a checkpoint not complete tho??

(we cannot reuse a redo log until the blocks it protects are FLUSHED TO DISK)

checkpoints and Mviews

Bakunian, October 28, 2005 - 2:21 pm UTC

It starts at 4 AM when Mviews begin to refresh. I don't think making larger logs will help making DBWR lag even further. Mviews refresh lots of data therefore if any particular view has 100 Gb and it is all will be done in one transaction then no matter how many logs I have the check point will never complete. I am right? Is there way around it?

Fri Oct 28 04:00:26 2005
Thread 1 advanced to log sequence 22968
Current log# 4 seq# 22968 mem# 0: /d01/redo04A.log
Thread 1 advanced to log sequence 22969
Current log# 1 seq# 22969 mem# 0: /d01/redo01A.log
Thread 1 advanced to log sequence 22970
Current log# 5 seq# 22970 mem# 0: /d01/redo05A.log
Thread 1 advanced to log sequence 22971
Current log# 6 seq# 22971 mem# 0: /d01/redo06A.log
Thread 1 advanced to log sequence 22972
Current log# 7 seq# 22972 mem# 0: /d01/redo07A.log
Thread 1 advanced to log sequence 22973
Current log# 8 seq# 22973 mem# 0: /d01/redo08A.log
Thread 1 advanced to log sequence 22974
Current log# 9 seq# 22974 mem# 0: /d01/redo09A.log
Thread 1 advanced to log sequence 22975
Current log# 2 seq# 22975 mem# 0: /d01/redo02A.log
Thread 1 advanced to log sequence 22976
Current log# 3 seq# 22976 mem# 0: /d01/redo03A.log
Thread 1 advanced to log sequence 22977
Current log# 4 seq# 22977 mem# 0: /d01/redo04A.log
Thread 1 cannot allocate new log, sequence 22978
Checkpoint not complete
Current log# 4 seq# 22977 mem# 0: /d01/redo04A.log
Thread 1 advanced to log sequence 22978
Current log# 1 seq# 22978 mem# 0: /d01/redo01A.log
Thread 1 cannot allocate new log, sequence 22979
Checkpoint not complete
....
And so on untill 6 AM when it finishes.

October 29, 2005 - 10:40 am UTC

.... I don't think making larger logs
will help making DBWR lag even further.
...

why don't you think that?

What is happening here is at 4am you do A TON OF WORK, A BIG BURST OF ACTIVITY.

You generate gobs and gobs of redo.
You generate tons of dirty blocks.

As we flip through redo logs and get back to the "first one", all of the blocks it protects must be flushd - but dbwr hasn't had time to flush them all yet.

Adding another redo log - well, that will by definition give dbwr MORE TIME to write the blocks protected by the first redo log file (whatever log file was current when the burst of activity started).

You need enough configured online redo log to get you through this BIG BURST of sudden activity.

You have the "classic flood of activity that dbwr isn't able to keep up with". Nothing more or less.

The number of transactions - not relevant (in fact, if it did it in many little transactions - it would generate redo even faster making the problem worse, so be grateful it is a single transaction, it reduces the workload!)

RE: redo log optimal size

Bakunian, October 31, 2005 - 4:07 pm UTC

So is there a way to find optimal redo log size and/or number of redo logs for recurring problem. I am kind of unsatisfied with method of "scientific poking" therefore hoping for our input.
I looked at v$instance_recovery but dont think its that useful I wish in 9iv2 there was v$redo_size_advice or v$logstats.

November 01, 2005 - 5:17 am UTC

what is the recurring problem?

if it is the immediately above, this is the key message to take away:

....
You need enough configured online redo log to get you through this BIG BURST of
sudden activity.
........

how much redo do you generate in some unit of time?
how long (how many units of time) is your burst?

RE: Negative redo size

Bakunian, November 02, 2005 - 1:35 pm UTC

Tom - What does it mean when stas pack report shows redo size as a negative number?

Per Second Per Transaction
--------------- --------------
Redo size: -456,871.17 -74,802,270.36

Thank you.

November 03, 2005 - 6:29 am UTC

it probably means you have generated so much redo since the instance has been up that the numbers (stored in a 32bit integer) have wrapped around and gone negative.

do you see negative numbes in the v$ views themselves? was the instance up for a long time?

(either that or the numbers just rolled and went back to zero for somethings - if the next report you run looks "ok", they rolled - and were treated unsigned internally)

at shutdown time

Asim Naveed, February 01, 2006 - 4:37 am UTC

1-Does checkpoint occurs when we shutdown the database regardless of the status of redo log files.?

2- DBWR always reads from the redo log file what needs to be changed and then change (write) it in the datafile, it does never read from RAM(buffer cache), correct?

Thanks

February 01, 2006 - 8:34 am UTC

1) if it is a clean shutdown, the files are made consistent, effectively checkpointed yes.

2) dbwr doesn't read redo. dbwr manages the buffer cache. It pretty much only reads from the buffer cache.

need some clarification

shams, March 30, 2006 - 9:35 pm UTC

SQL> select checkpoint_change#,controlfile_change# from v$database;

CHECKPOINT_CHANGE# CONTROLFILE_CHANGE#
------------------ -------------------
            203454              204955

Qs.) Why the number is different ? 

            
SQL> select controlfile_sequence# from v$database;

CONTROLFILE_SEQUENCE#
---------------------
                  293

Qs.) what is controlfile_sequence# ?

March 31, 2006 - 11:42 am UTC

if they were to be the same, why would we have two columns?

from the reference guide:

CONTROLFILE_CHANGE# NUMBER Last SCN in backup control file; null if the control file is not a backup

CHECKPOINT_CHANGE# NUMBER Last SCN checkpointed

from same place:

CONTROLFILE_SEQUENCE# NUMBER Control file sequence number incremented by control file transactions

reader

A reader, June 19, 2006 - 11:03 am UTC

these below two statements are of the book "begining oracle programming"

1)A checkpoint is an event that takes place in the database THAT CAUSES database writer to flush blocks from buffer pool to disk.

2)The database writer will flush its buffer to disk periodically WHICH TRIGGERS CHECKPOINT.

Are these statements not contradicting that checkpoint causes database writer to flush blocks to disk and altoghether it says database writer flushes block that causes checkpoint......

June 19, 2006 - 6:29 pm UTC

can you give a bit more context - are those the "complete" sentences and give the sentence or two in front and back of each please (i don't carry the books with me and did not write all of beginning Oracle programming... so no electronic versions either)

reader

A reader, June 23, 2006 - 10:24 am UTC

Hi Tom

I was reading about checkpoint recently and have some doubts:

When Oracle starts its recovery process, it needs to read the log files first and determine from which point it needs to recover the data blocks. Oracle also needs to open the redo log files and

1)what is "log files" he is referring to. is it the same as redo log file....

2)why it needs to read log file ,like what it gets from there...shouldn't it read the control file.
=================================================
Each checkpoint record consists of a list of all active transactions and the address of the most recent log record for those transactions. Checkpoints are taken at specified intervals by Oracle and involve the following steps:

Flush the contents of the redo log buffers to the redo log files.

Write a checkpoint record to the redo log file.

Flush the contents of the database buffers cache to disk.

Update the data file headers and the control files after the checkpoint completes.

3)He says that checkpoint record consists of two things what is this 2nd thing "most recent log record for those transactions"...it is somewhat blurred,like what does it contains
================================================
At the same time, the control file records the checkpoint location by using a checkpoint redo byte address (RBA).

4)what is this checkpoint location?
===================================================

thanks

June 23, 2006 - 1:13 pm UTC

1) log files are redo log files, yes.

2) it reads the log files to get the redo information and checkpoint information. controlfiles have necessary information as well and are used.

3) the ADDRESS of the most recent log records. It would indicate it contains a pointer whereas to "begin".

4) the "progress" of the checkpoint, the "state" of the checkpoint, the "how far it is" of the checkpoint.

see what Roderick has to say below, this "detailed" information is not likely entire accurate either.

READER

A reader, June 23, 2006 - 2:00 pm UTC

Hi Tom,

Thanks for quicker reply.

could you please make it a bit more clear for a novice like me that what checkpoint information is there in redo log file..i know checkpoint scn is there in datafile and control file....

thanks

June 23, 2006 - 2:06 pm UTC

the checkpoint records are there. As stated.

A novice needs to know basically that:

Upon database startup, Oracle will read the redo and roll forward the database to the point right before the "crash" that necessitates this recovery.

After this roll forward process, which also catches up the UNDO information, Oracle will roll back any uncommitted transactions.

The database is now fully recovered to the last committed state prior to the crash that caused recovery.

Sometimes too much detail is just that - too much detail. We roll forward from redo, we rollback from undo - database recovered.

Hmmm

Roderick, June 23, 2006 - 2:10 pm UTC

I've heard of "checkpoint progress records" and those are stored in the control file. But I have never heard of a "checkpoint record" that maintains a list of active transactions. But I don't know what material is being quoted here.

June 23, 2006 - 2:25 pm UTC

I prefer the "concept"

roll forward
roll back
we are recovered

that is about the sum of what we need to know.

It is likely whatever material the reader was reading really meant to say "commit" perhaps.

In any case - too much information.

we roll forward
we roll back
we are recovered

we use control files, redo files and datafiles for this (of course the parameter files as well)

I agree

Roderick, June 23, 2006 - 6:48 pm UTC

For a novice, hopefully it's more than sufficient to understand that checkpointing (the verb) is how a Database Management System balances the need for high performance with the need to recover quickly from an outage.
Checkpoints (the noun) are the structures that support that activity. One can get lost trying to understand those structures and implementation in too much detail. I'm not sure when a DBA would need to know much about those structures. Just follow a good backup / recovery strategy and test it.
From a performance point of view, maybe, at most, you need to be aware that redo log size, number of log groups, and fast_start_mttr_target can influence how aggressively DBWR flushes dirty blocks to disk (as part of checkpointing - the verb). But instance tuning can be achieved without understanding how checkpoints work in much detail these days.

dirty buffers

A reader, August 16, 2006 - 6:20 am UTC

Hi

A transactions modifies 100 blocks, then commits. Are thse 100 blocks still considered dirty?

August 16, 2006 - 8:39 am UTC

there are exactly two answers to all questions:

</code> http://asktom.oracle.com/Misc/i-believe-strongly-there-are-only-two.html <code>

the answer to this is "it depends"

yes, the blocks are 'dirtied' by the modifications, but a checkpoint can happen at anytime -- flushing some of the 100 blocks out.

the likely most probable answer is "yes, there would be 100 dirty blocks in the cache, probably"

dirty buffers

A reader, August 17, 2006 - 1:28 pm UTC

Hi

May be I should have stated mreo cleary so the answer would not be depends :-P

What I mean is if I modify 100 blocks, commit, since these blocks are modified they should be consider as dirty by DBWR isnt it?

I wonder if a modified block not commited and commited are considered both dirty in cache?

Can we have dirty blocks in datafiles?

August 17, 2006 - 2:49 pm UTC

it depends.

dirty blocks are a "buffer cache only" concept. You may have blocks on disk that contain committed and uncommitted data.

If you modify 100 blocks
AND NONE OF THEM WERE WRITTEN BY DBWR by the time you commit
THEN you would have 100 dirty blocks in need of checkpointing in the buffer cache.

dirty buffer confusion

RobH, September 07, 2006 - 1:11 pm UTC

Tom, 

I have a small table:
SQL> select blocks from dba_tables where table_name  = 'CWPACTIVITY';

  BLOCKS
--------
      26

That has an index (called CWPACTIVITY1) on a column called PROCESS_ID:
SQL> select blocks from dba_segments where segment_name = 'CWPACTIVITY1';

  BLOCKS
--------
      64

It appears that the DML against the table is high(lots of inserts and deletes).

When I look into the buffer pool I see:
SQL> exec ismdba.print_table('select * from whats_in_buffer where OBJECT=''CW.CWPACTIVITY1''');

BF_POOL                       : DEFAULT
OB_TYPE                       : INDEX
OBJECT                        : CW.CWPACTIVITY1
BLOCKS                        : 7543
DATA_BLOCKS                   : 64
PCT_IN                        : 11785.9
DRTY                          : #####% -    7542
CACHE_USAGE                   :  05.1% -    59MB
-----------------

PL/SQL procedure successfully completed.


 
There are 2 interesting issues:
#1 - 7542 dirty blocks, but the index is only 64 blocks
#2 - the index never gets physically bigger on disk even though PROCESS_ID is increasing and its constantly inserting and deleting in that column (this alludes to some other posts you've written about the necessity to rebuild indexes)
#3 - This is the most confusing, a checkpoint doesn't appear to write out the dirty buffers to disk, or rather, they never appear to become clean.  The dirty buffers just stay at around 7000 dirty blocks.

Reference: View - whats_in_buffer

select a.*,
to_char(round(ratio_to_report(blocks) over(partition by bf_pool)*100,1),'00.0')||'% - ' || 
to_char(round(a.blocks*(select value from v$parameter where name = 'db_block_size')/1024/1024,0),'9900') ||'MB' cache_usage
from
(
select
        /*+ ordered */
        bp.name                        bf_pool,
        o.object_type  ob_type, 
        o.owner||'.'||o.object_name                 object,
        sum(ct)                        blocks,
        seg.blocks               data_blocks,
        round(sum(ct)/ seg.blocks *100,1) pct_in,      to_char(round(sum(bh.BUF_DIRTY)/nullif(sum(ct),0)*100,1),'00.0') ||'% - '||to_char(sum(bh.BUF_DIRTY),'999900') drty    
from
        (
        select
               set_ds,
               obj,
               count(*) ct,
               SUM(DECODE(BITAND(flag,1),0,0,1)) BUF_DIRTY, 
               SUM(DECODE(BITAND(flag,1),0,1,0)) BUF_CLEAN
        from
               x$bh
        group by
               set_ds, 
               obj
        having count(*)/5 > (
                       select max(set_count) 
                       from v$buffer_pool
                       )
        )                      bh,
        dba_objects            o,
        obj$                   ob, 
        dba_segments           seg,
        x$kcbwds               ws,
        v$buffer_pool          bp
where
        o.data_object_id = bh.obj and o.object_name = seg.segment_name and o.owner = seg.owner
and     ob.dataobj# = bh.obj
and     ob.owner# > 0
and     bh.set_ds = ws.addr
and     ws.set_id between bp.lo_setid and bp.hi_setid
and     bp.buffers != 0        --  Eliminate any pools not in use
group by
        bp.name,
        object_type,
        o.owner,o.object_name,
        seg.blocks
order by blocks desc,
        bp.name,
        o.object_name) a;

September 08, 2006 - 3:56 pm UTC

sorry, no going to comment on x$ queries - get something going against v$bh and we can talk.

Addition

RobH, September 10, 2006 - 9:27 pm UTC

I apologize.  I've removed all X$ view entries(is obj$ fine?).  The same situation still exists:

SQL> set serveroutput on size 999999
SQL> create view whats_in_buffer as
  2  select a.*,
  3  to_char(round(ratio_to_report(blocks) over()*100,1),'00.0')||'% - ' ||
  4  to_char(round(a.blocks*(select value from v$parameter where name = 'db_block_size')/1024/1024,0),'9900') ||'MB' cache_usage
  5  from
  6  (
  7  select
  8          /*+ ordered */
  9          o.object_type  ob_type,
 10          o.owner||'.'||o.object_name   object,
 11          sum(ct)                       blocks,
 12          seg.blocks                    data_blocks,
 13          round(sum(ct)/ seg.blocks *100,1) pct_in,
 14  to_char(round(sum(bh.BUF_DIRTY)/nullif(sum(ct),0)*100,1),'00.0') ||'% - '||to_char(sum(bh.BUF_DIRTY),'999900') drty
 15  from
 16          (
 17          select
 18                 objd obj,
 19                 count(*) ct,
 20                 SUM(case when dirty = 'Y' then 1 else 0 end) BUF_DIRTY,
 21                 SUM(case when dirty = 'N' then 1 else 0 end) BUF_CLEAN
 22          from
 23                 v$bh
 24          group by
 25                 objd
 26          having count(*)/5 > (
 27                         select max(set_count)
 28                         from v$buffer_pool
 29                         )
 30          )                      bh,
 31          dba_objects            o,
 32          obj$                   ob,
 33          dba_segments           seg
 34  where
 35          o.data_object_id = bh.obj and o.object_name = seg.segment_name and
 36  o.owner = seg.owner
 37  and     ob.dataobj# = bh.obj
 38  and     ob.owner# > 0
 39  group by
 40          object_type,
 41          o.owner,o.object_name,
 42          seg.blocks
 43  order by blocks desc,
 44          o.object_name) a;

View created.


SQL> exec print_table('select * from whats_in_buffer where OBJECT=''CW.CWPACTIVITY1''');
OB_TYPE                       : INDEX
OBJECT                        : CW.CWPACTIVITY1
BLOCKS                        : 7786
DATA_BLOCKS                   : 64
PCT_IN                        : 12165.6
DRTY                          : #####% -    7785
CACHE_USAGE                   :  05.1% -    61MB
-----------------

PL/SQL procedure successfully completed.

Your starting to make me feel like I'm back in high school with your response. Regardless, after speaking to some other oracle people they pointed me to some solutions and other issues (although, they were suprised by the scale of the problem).

The sys statistic of 'switch current to new buffer' was increasing rapidly (150 per second).  He also advised me to check the # of copies of the blocks in the buffer using another query:
SQL> select file#, block#, count(*) from v$bh where objd = 25313 group by file#, block# having count(*)>5;

     FILE#     BLOCK#   COUNT(*)
---------- ---------- ----------
         8       3990       7737

(It was querying the x$ view and I hopefully have corrected so not to make the same mistake twice in this post).

I just found this very odd and interesting behaviour.  It looked to me like a design problem that I'd never seen before and was hoping you had.

September 11, 2006 - 9:16 am UTC

sorry you feel like this is high school, I just don't look at the undocumented stuff, don't comment on it, won't spend time figuring out what might be wrong with your intrepetation of the x$ views and the like. Not very efficient for either of us. I'd have to dig around and see if you were bitanding things correctly, using the views in the right way.

I would say that 7,000+ copies is unusual - what release are we looking at here?

your view can be misleading - segment_name/owner is not unique in dba_segments, it could be multiplying out values (the v$bh straight query however does not have that problem - it would be interesting to see the status, dirty columns included in that however as group by columns - you know the file# and block#, just query that file#/block# and include status/dirty - before and after a checkpoint (alter system checkpoint) if you can)

so I'd drop the view - at least remove the dba_segments from it, simplify it.

More info

RobH, September 11, 2006 - 12:29 pm UTC

I respect your comments about x$.  Here is some additional info (including version, which may play a part as well):

The application inserts and deletes a lot into this small table and this index is modified a lot (yet, it never gets phyiscally larger).

Connected to:
Oracle9i Enterprise Edition Release 9.2.0.5.0 - 64bit Production
With the Partitioning, OLAP and Oracle Data Mining options
JServer Release 9.2.0.5.0 - Production

SQL> set lin 140
SQL> select file#, block#, status, dirty, count(*) from v$bh where objd = 25313 group by file#, block#, status, dirty;

     FILE#     BLOCK# STATU D   COUNT(*)
---------- ---------- ----- - ----------
         8       3983 xcur  N          1
         8       3990 cr    N          1
         8       3990 cr    Y       7783
         8       3990 xcur  Y          1

SQL> alter system checkpoint;

System altered.

Immediately:
     FILE#     BLOCK# STATU D   COUNT(*)
---------- ---------- ----- - ----------
         8       3983 xcur  N          1
         8       3990 cr    N          1
         8       3990 cr    Y       7021
         8       3990 xcur  Y          1

After a few seconds:


     FILE#     BLOCK# STATU D   COUNT(*)
---------- ---------- ----- - ----------
         8       3983 xcur  N          1
         8       3990 cr    N          1
         8       3990 cr    Y       5737
         8       3990 xcur  Y          1

After a approx a minute:

     FILE#     BLOCK# STATU D   COUNT(*)
---------- ---------- ----- - ----------
         8       3983 xcur  N          1
         8       3990 cr    Y       2849
         8       3990 xcur  Y          1

But then...
     FILE#     BLOCK# STATU D   COUNT(*)
---------- ---------- ----- - ----------
         8       3983 xcur  N          1
         8       3990 cr    Y       3195
         8       3990 xcur  Y          1

and eventually:
     FILE#     BLOCK# STATU D   COUNT(*)
---------- ---------- ----- - ----------
         8       3983 xcur  N          1
         8       3990 cr    Y       7311
         8       3990 xcur  Y          1

It appears that the dirty blocks are mostly flushed to disk, but by the time they've all been flushed many more dirty blocks show up again, and eventually an equilibrium occurs around 7000 dirty blocks.

Again, thanks.

September 11, 2006 - 1:43 pm UTC

but the cr blocks don't need to goto disk. There can truly be "one" dirty block at a time only (we make changes to current mode blocks - we all change the same "last block")

Those are all cr (consistent read) blocks - they are not really "dirty", not in the sense that they need to be flushed to disk.

so, ignore the "dirtyness".

But having 7,000 plus copies of that block (cr versions of it) would be unusual. There shouldn't be more than a handful (it would be rare for double digit numbers of copies)

For that I'll refer you to support, I cannot reproduce but what I'd probably need to do to reproduce would be to setup a large number of concurrent reads and writes against this very small table. the large number of CR block clones in there could lead to cache buffers chains latch contention (as they all hash to the same list by data block address and it would take a while to search through them all)

lots of CR blocks

a reader, September 11, 2006 - 5:04 pm UTC

The bit about the small table that undergoes a lot of DML resonates. I have worked with an application that uses a single number generator table instead of Oracle sequences. Instead of a bunch of hits to seq$, you get a bunch of hits on this number generator table. Gets/execution for a single row access ( a single row is analagous to a single sequence ) should have been very low ( < 10 I suppose ) but we saw 6,000 gpe during relatively slow times, up to 250k gpe when the demand for blocks was high. The additional LIO seems to have been due to read consistency activity - the "data blocks .. undo applied" statistic had very large numbers.

In this application the problem was fixed in by forcing one row per block in the number generator table - we'll see what happens if they ever start up multiple thread per row ( per sequence, if you will). I didn't know about v$bh at the time I was looking at this problem, I wonder if I v$bh would have shown very high CR numbers during peak load times before the fix was applied. And perhaps the buffer cache would have contained an unusually large number of undo blocks being used for read consistency(?).

Just speculating...

September 11, 2006 - 6:13 pm UTC

oh, that definitely has something to do with it - but.... the number of CR blocks in the buffer cache here is high, might be "nothing to worry about", but could cause cache buffers chains contention....

Thanks

RobH, September 12, 2006 - 12:01 pm UTC

Yes, the last post sounds very similar. I didn't build the application, the design sounds similar, but I think they are using a sequence, however they are inserting and deleting into this temporary table (there are only ever a few thousand rows in the table) and that the sequence or id can be repeated many times.

I was only brought on because of performance issues and I noticed that there was a lot of IO problems and was investigating what was consuming so much of the buffer cache (even pushing it out). This wasn't "necessarily" the problem, but was an oddity I noticed during the investigation.

I still find the fact that this index is not growing incredible, even though it is constantly changing(rather inserting/deleting) and indexing a column that is a sequence (even if it is used multiple times).

-

Ik, October 25, 2006 - 7:24 am UTC

Tom,

A question on how checkpointing is done when the datafile is taken offline.

This is the scenario

1. Insert/Update data into a table which uses datafile01
2. take datafile01 offline
3. commit
4. alter system --- log switch
5. log switches and old log group status is "active"
6. alter system checkpoint
7. old log group status is "inactive"

My question is, in this scenario how can checkpoint complete when the datafile to which data needs to be written into is offline?

Thanks,

October 25, 2006 - 9:41 am UTC

the act of taking the datafile offline checkpointed that datafile already. We have no need for redo for that datafile anymore.

Michel Cadot, October 26, 2006 - 7:15 am UTC

Now the question is what happens if you roll back in step 3 instead of commit? :)

Michel

October 26, 2006 - 12:05 pm UTC

the undo information for that transaction is saved (system rollback segments if possible) so that when the file is onlined again, we can apply it.

Follow up

Ik, November 02, 2006 - 6:49 am UTC

Tom,

Apologies for following up so late on the question. This is a followup to my review question on how checkpointing is done in the given scenario of taking datafile offline (October 25, 2006).

Oracle documentation says that taking a datafile offline does not perform a checkpoint.

</code> http://www.stanford.edu/dept/itss/docs/oracle/10g/server.101/b10759/statements_1004.htm#i2150231 <code>

"OFFLINE

Specify OFFLINE to take the datafile offline. If the database is open, you must perform media recovery on the datafile before bringing it back online, because a checkpoint is not performed on the datafile before it is taken offline."

Iam confused. Please, could you clear this one for me?

Thanks,

November 02, 2006 - 7:16 am UTC

indeed. well, the blocks do not need to be checkpointed - since you'll be doing media recovery to that file, it'll reconstruct everything from the redo information.

Ik, November 02, 2006 - 9:03 am UTC

Yes. and so that redo log information is absolutely essential, right?

So, unless the datafile is brought back online and media recovery happens, how can 'alter system checkpoint' succeed and how can the redo log group be made 'INACTIVE'. By making it 'INACTIVE', the redo group becomes a candidate for reuse. It cannot be reused as it still protects information needed to bring back the datafile online.

Wouldnt there be a data loss if the redo group gets reused and the instance goes down. Next time when the datafile is brought back online, will it rely on the archive logs to bring it back online?

Thanks,

November 02, 2006 - 9:10 am UTC

why wouldn't the checkpoint work? it checkpoints to available files.

you have to be in archivelog mode to offline it
you will have to do MEDIA recovery (apply archives)

Thank You

Ik, November 02, 2006 - 9:17 am UTC

Writing to Redo and not to Datafile

Paulo, February 22, 2007 - 9:39 am UTC

Tom, this is probably a silly question, but could you please explain why writing to the redo log files is faster than writing to the datafiles? I never really got that... Why dont we just write to the datafiles when we commit our work?

Thanks a lot.

February 22, 2007 - 10:38 am UTC

datafiles are dealt with in blocks - typically say 8k.

So, the smallest write would be 8k

And blocks in datafiles are all over the place - lots of random IO.

redo logs on the other hand:

o changed bytes only, small
o sequentially written to, not seek/write/seek/write/seek/write - just "write write write write"

Question

anupam pandey, February 25, 2009 - 7:19 am UTC

Hi Tom,
Is there any system view from where can i know whether any nologging operation was performed after last checkpoint.

Thanks And Regards,
Anupam Pandey

February 25, 2009 - 5:44 pm UTC

checkpointing happens almost continuously these days, and even so, it was never a "single point in time" operation (it might take a long time for a checkpoint to complete - it might take a long time for a non-logged operation to start/finish as well, they are not 'digital' events)

So, it wouldn't really be useful - however, you can query v$datafile for
http://docs.oracle.com/docs/cd/B19306_01/server.102/b14237/dynviews_1076.htm#REFRN30050
unrecoverable_change#

A reader, January 24, 2010 - 4:55 pm UTC

Hi sir;

If I take the datafile offline, I will need recovery until last_change# of the datafile.
The reason behind this, checkpoint is not issued.
I know the task of checkpoint (write dirty blocks to disk,etc) but what is the relation between
checkpoint and recovery?
Since dirty blocks remain in buffer cache, they can be written when I bring up the datafile.
or
Suppose there is no dirty blocks of that datafile and I take it offline.
Again why do I need recovery?

January 26, 2010 - 1:50 am UTC

if you take it offline normal, you will not.

taking the file offline would checkpoint it. If you are in noarchivelog mode, it is the ONLY way to offline the file (without dropping the file)

checkpoint

MemonSalim, January 29, 2010 - 5:18 am UTC

Hi sir,
every committed transaction has system change number and it is recorded into control file.and record byte address is also recorded into control file.then what is difference between record byte address and system changed number?

January 29, 2010 - 4:19 pm UTC

... every committed transaction has system change number and it is recorded
into control file...

not, it is not recorded in the control file.

difference RBA AND SCN

MemonSalim, February 01, 2010 - 11:25 pm UTC

hi sir,
i know what is redo byte address and system change number but please give me difference between RBA and SCN.and what is use of RBA and SCN................

February 02, 2010 - 12:23 pm UTC

if you know what they are... why would you need to know the difference between them?

the scn is a clock, a ticker, it increments up every time a transaction completes. It is like a clock.

the RBA is a pointer into the redo stream, it is like a pointer, an arrow, a postition in a file. It is in fact a file identifier, block within file, offset within block - a pointer to a position within a redo log file. Here is a good description of it
http://www.ixora.com.au/notes/rba.htm

The scn is just used like time would be. Instead of saying "recover this database till 09:15 on 1-jan-2010", you could say "recover until scn 2141415332", it is a fine grained clock, a ticker.

The RBA is a pointer into a file.

difference RBA AND SCN

MemonSalim, February 03, 2010 - 12:38 am UTC

hi sir,
Thank you very much for good description.it's solved my difficulty.

sir , i have doubt that is RBA is stored in control file?

i am waiting for your reply.
bye sir. have a nice day.

February 03, 2010 - 10:06 am UTC

it is more of a datafile thing, not a control file thing.

difference RBA AND SCN

MemonSalim, February 03, 2010 - 11:30 pm UTC

hi sir,
sorry sir,for asking same question but this question makes me more eager to know .

as explain,the RBA is a pointer into the redo stream, it is like a pointer, an arrow, a postition in a file.
but how to oracle know which pointer is useful to points to redo log.
so my question is that, is RBA information recorded into control file?

and at last thank you sir again for giving me reply

bye sir

February 04, 2010 - 12:32 pm UTC

yes, it is. You could see it in control file dumps (no, I'm not dumping a control file - I don't do that).

cursor

MemonSalim, February 04, 2010 - 11:19 pm UTC

hi sir,
Thank you sir, i understood very well about R.B.A.
but now my question is that whenever user fire statement at that time what is done first.

1)weather memory is allocated for cursor or
2)oracle checks statement in shared pool.

bye sir.

February 07, 2010 - 11:51 pm UTC

you gotta have a data structure to look for before you start looking, there would be some amount of UGA memory allocated to start the process of building the cursor - before you look for it in the shared pool.

drop instance

ahmed adel, February 08, 2010 - 5:34 pm UTC

hi ,tom
how can i recover my instance if i drop my main instance(orcl) with oradim -delete
if i recreate it ever thing will work good
thanks

February 15, 2010 - 11:04 am UTC

you would follow the same exact steps you would have followed to restore your database to a new machine from backups - without having to copy the files from backup to disk of course.

So, just do what you would do to restore. (yes in short, use oradim to reregister the instance)

automatic segment-space management

MemonSalim, February 20, 2010 - 6:02 am UTC

hi sir,
please give me answer of the following question.

For a tablespace created with automatic segment-space management, where is free space managed?

A. in the extent

B. in the control file

C. in the data dictionary

D. in the undo tablespace

bye sir.

February 25, 2010 - 3:29 am UTC

a quiz? isn't the goal of a quiz to test your knowledge, knowledge that YOU HAVE BEEN TAUGHT, knowledge that you are already supposed to have?

use logic here, what can you rule out, what is left, what does the documentation have to say?

asm

ahmed, February 26, 2010 - 3:27 am UTC

hi,tom
how can i register my asm that i have created with oradim
in my enterprise manger(em)
(shortly when i open my em i don't found my asm instance that i have created with oradim)
although i have my mounted diskgroup

thanks

asm

ahmed, February 26, 2010 - 3:29 am UTC

automatic segment-space management

MemonSalim, February 28, 2010 - 11:38 pm UTC

hi sir,
According to me answer should be 1) in the extent. but please tell me that am i right or not? bye sir

March 01, 2010 - 12:02 pm UTC

explain your work, cite references, point to documentation, explain your thoughts.

And we'll glad comment on your work.

A quiz is to test YOUR knowledge, show us your work and we'll tell you if you got the research right.

Maybe start by explaining why the other three could not/would not be the place for it...

John Carew, April 21, 2010 - 6:10 pm UTC

Hi sir;

What is the difference between issuing:

alter system checkpoint;
and
flush buffer cache;

Suppose, my intention is to flush entire buffer cache.
Is there any difference between issuing above two statements?

If I am taking benchmark and want to clear buffer_cache, which one of them is preferred?

Many thanks

April 21, 2010 - 6:25 pm UTC

one command does a checkpoint which writes out dirty blocks, this does not empty the cache.

the other one empties the buffer cache.

If you are bench marking, probably the last thing you want to do is flush the cache, that is so misguided.

Under what real world situation would you realistically have an empty buffer cache?

And I hope you are using RAW or ASM or forcedirectio mount options - if you aren't, you better search for the term:

"secondary sga"

on this site and make sure you understand how most conventional file systems work (making your flush buffer cache even less "real" - harmfully less "real")

A reader, April 28, 2010 - 10:05 pm UTC

....one command does a checkpoint which writes out dirty blocks, this does not empty the cache

If the dirty blocks are written, there shouldnt anything else inside the buffer cache?
Am I wrong? little bit confused tho.

Suppose I am benchmarking two diffrent sql queries.In this case, I may need to flush buffer cache once I
benchmark first query.
Am I wrong?

April 29, 2010 - 7:32 am UTC

because we put a block out to disk - does that mean we need to MOVE it out to disk or COPY it out to disk. Why would the act of writing the block to disk have to remove it from the cache????

Are you wrong? Yes, I think it is very wrong to flush the buffer cache in order to "benchmark". Tell me - in real life - when would your cache ever be devoid of blocks? Never - never ever. flushing the cache is more "un-real life" than any other approach.

Also, search this site for 'secondary sga' to see why you would almost certainly just be fooling yourself. You would have to REBOOT YOUR SERVER if you wanted to flush the cache (because you would need to flush the OS file system cache too)

A reader, April 29, 2010 - 9:51 pm UTC

Hi sir;

so, checkpoint writes out the dirty blocks to disk but dirty blocks are not cleared in the cache, they still exist.
am I right?
If that is the case, how does oracle understand wheter the dirty block is already written to disk or not?

BTW, i never considered os file system cache, thanks sir

April 30, 2010 - 6:59 am UTC

there is a list of blocks (a list of pointers to blocks) that have been dirtied and are in need of checkpointing.

When they are checkpointed - they are taken off of this list
When they are subsequently modified again - they are put back on

A reader, May 17, 2010 - 4:39 pm UTC

Thanks sir.

I just want to confirm that;

1-)Suppose I performed a "select statament" and block doesnt exist in buffer cache, oracle will retrieve this block
from disk to buffer cache. But the block will be still clean? Am I right?

2-)The block will only be dirty, if we perform: insert,update,delete on this block am I right?

3-)so, does it mean that after checkpoint dirty blocks will not become clean automatically?

May 24, 2010 - 9:12 am UTC

1) it depends, there is delayed block clean out - your read might dirty the block to clean it.

2) see #1

3) how did you come to that conclusion?

Database crashes while huge insertion process is going on

Suhail Dev, June 25, 2010 - 5:31 am UTC

What would happen if my database goes down[say due to electricity failure or some other issue] while a huge database insertion/updation [say 100M] is going on.
My autocommit is off and half of the data is inserted/updated before database goes down. What would be the state of my table after my database is restarted.

Thanks

July 06, 2010 - 10:03 am UTC

if the database is terminated for any reason while a transaction is running (and hence has NOT yet committed), then instance crash recovery will roll the transaction forward during recovery and then upon discovering "it did not commit", roll it back.

The insert will disappear, it would be the same as if you ctl-c'ed the insert while it was running, it will roll back.

A reader, July 03, 2010 - 10:45 pm UTC

I know that, once the checkpoint is issued,oracle will write dirty blocks to disk.

Once the drity blocks are written to disk, do they become clean? or are they still dirty? I mean is there a background process which cleans them once they are written

July 06, 2010 - 2:43 pm UTC

they are only dirty in the cache. dirty blocks is a term used to refer to blocks in the cache that need be flushed to disk. Once they are flushed, they are no longer dirty until they are modified again.

Checkpoint and Undo

Arindam, September 12, 2010 - 1:08 am UTC

Hi Tom,

I was going through the threads and found some interesting queries in my mind.

1) Checkpointing is done once log switch happens, after every 3secs or if DB cache fills (DBWR will ask LGWR to flush the info to log file and then in turn checkpoints).
Now, (i) If it happens every 3 secs, can it be in sync with the timings of LGWR (since LGWR also does it after every 3 secs, will then checkpoint)? (ii) I read that checkpoint keeps the record of last SCN occurred in the database, how does it find it? (iii)Why CKPT process updates datafile header and control file with last SCN and RBA, why cannot LGWR do it?

2) You said Undo blocks are flushed to undo segment (tablespace) in disk like data whenever log switch happens. Then (i) Is undo buffer inside DB buffer cache? (ii) Is the same LRU list maintained for undo like data? (iii) If concerned log contents of a uncommitted transaction is overwritten during log switch then does the user process finds the undo data in disk? If yes then how can it succeeds?

September 13, 2010 - 2:19 pm UTC

1) checkpointing it not done every 3 seconds.

It is safe however to say that these days, checkpointing is almost continuously happening. log switches do it, cache filling up would do it, fast_start_mttr_target does it, among other reasons.

where do you see 3 seconds documented for checkpointing?

2) I said undo blocks are flushed to disk just like data is. (stop). Add nothing more after that. That is a true statement.

Undo blocks look pretty much like data blocks - in fact, you could create rollback segments in a 'regular old tablespace' if you want (we used to have to do that in the past).

Checkpointing

Arindam, September 13, 2010 - 11:32 pm UTC

Hi Tom,

So you are saying that Checkpointing is happening all the time due to several causes, so the datafile header is updated with latest SCN and control files are updated with latest SCN and RBA all the time. And the recovery will happen on the basis of the SCN and RBA.

1) Is it correct?
2) Now why checkpoint number which is supposed to grow by 1 with a small update, increases by huge numbers? I came to know many other processes increases checkpoint number (or I can say SCN). So are the processes those which you had mentioned?
3) Can I calculate how much my SCN or checkpoint number will increase?

September 14, 2010 - 7:07 am UTC

1) pretty much, the database is constantly keeping things up to date.

2) the SCN (do not call it a checkpoint number, the SCN is the SCN, a checkpoint change# is an observation of the SCN. Calling the SCN a checkpoint number would be like calling TIME on a clock "lunch" - sometimes TIME on a clock represents lunch, but most of the time it does not.

The SCN is incremented every time a commit or rollback occurs. Every time a transaction completes. Also, when doing distributed stuff - two systems will synchronize their clocks (SCN's) for recovery periods. Just consider the SCN a monotonically increasing value - nothing more, nothing less.

3) depends, usually it'll increment by 1 for every transaction but other things (like distributed) can increase it as well.

Checkpointing

Arindam, September 14, 2010 - 2:51 pm UTC

Hi Tom,

Thanks. You said that "The SCN is incremented every time a commit or rollback occurs, every time a transaction completes."
Now my queries are:
1) I thought that whenever rollback happens the SCN stays as it is. Then how can it be changed during rollback?
2) If at SCN 12000 suppose I update a row and in the mean time someone else perform another update and commits it. Then when I commit or rollback, what will be the value of the SCN in the Database at both the cases?
3) If my database is not distributed and is my personal database then can I calculate the SCN? What factors should be counted then?
4) The column checkpoint_change# in v$database will show the latest SCN. Isn't right?

Thanks,

September 15, 2010 - 7:47 am UTC

1) the scn is a marker for the end of a transaction. we don't care about the outcome of the transaction - just that it is DONE. rolling back ends a transaction just like commit does.

2) the SCN is independent of the transaction ordering. It goes up whenever a transaction ends - we don't care when the transaction began, only when it ends.

If your SCN was 12,000 and you committed two transactions and nothing else committed/rolled back in that period of time, you would expect the SCN to be about 12,002.

3) you cannot, the SCN is constantly moving because the database is constantly working, constantly doing transactions itself.

It would be useless to be able to compute this anyway - there is NOTHING you could do with it.

4) no, it will not. it is the checkpoint change#, a POINT IN TIME. Just like you take lunch at 12 noon - your lunch change# would be 12 noon, but it might be 1:30pm right now - time keeps on moving, your lunch time does not, it is an observed point in time.

a checkpoint change# is an observed point in time.
an SCN is the clock, the clock never stops.

ops$tkyte%ORA11GR2> select dbms_flashback.get_system_change_number from dual;

GET_SYSTEM_CHANGE_NUMBER
------------------------
                 3748052

ops$tkyte%ORA11GR2> select dbms_flashback.get_system_change_number from dual;

GET_SYSTEM_CHANGE_NUMBER
------------------------
                 3748053

the scn is JUST A TIME, a ticker, constantly moving forward.

Checkpointing and SCN

Arindam, September 15, 2010 - 8:22 pm UTC

Hi Tom,

That is quite an interesting thing.Thanks for the response. So whenever we do a commit or rollback SCN is generated. Now 1) Do the LGWR writes the corresponding change vectors in online log files for both commit and rollback as I thought SCN will be generated whenever LGWR writes in online log files?
2) We would not get an SCN if we complete a transaction but still haven't committed or rolled back. Isn't it correct?
3) Checkpoint will be always happening which will try to keep the datafiles and controlfiles updated with the SCN at that point in time.Isn't it correct? Now I can say that during database shutdown the CKPT will update datafile header and control file with latest SCN.

September 16, 2010 - 6:41 am UTC

You really want to read the concepts guide OR if you like the way I say it - Expert Oracle Database Architecture. I've a feeling that 99.99% of your questions that will follow this are asked and answered in those books........ It'll save a bunch of time.

So whenever we do a
commit or rollback SCN is generated.

no, time advances forward, the SCN is incremented, advanced. Not "generated", it already exists - it always exists. It is like time.

Now 1) Do the LGWR writes the
corresponding change vectors in online log files for both commit and rollback
as I thought SCN will be generated whenever LGWR writes in online log files?

lgwr is CONSTANTLY writing committed and uncommitted changes to the redo log - in response to commits or just because it feels like it (it does not need to get a commit to write to the online redo logs, lgwr is more or less constantly flushing the redo log buffer of anything that is in it).

lgwr doesn't advance the scn every time it writes, the scn advances in response to a transaction finishing.

2) We would not get an SCN if we complete a transaction but still haven't
committed or rolled back. Isn't it correct?

that doesn't make sense. Tell me, how do you "complete a transaction" yet not commit nor rollback? In order to complete a transaction you issue commit or rollback.

3) Checkpoint will be always happening which will try to keep the datafiles and
controlfiles updated with the SCN at that point in time.Isn't it correct? Now I
can say that during database shutdown the CKPT will update datafile header and
control file with latest SCN.

No, it won't be updated with the current SCN at that point in time, it will be updated with the SCN that was in place as of the point in time the blocks were modified. Say in the buffer cache you have blocks that were modified at:

10am (scn= 1000)
10:30am (scn=1500)
11:00am (scn=2000)

and the first two blocks are checkpointed at 11:15am (scn=2500). The checkpoint change# would be 1500, the last modified block would be as of 2000 and the current SCN for the last completed transaction would be 2500 - they are TIMES

Checkpointing

Arindam, September 18, 2010 - 1:04 pm UTC

Hi Tom,

Thanks. I tried to get some idea by testing it in my personal database. I found that SCN is gradually increasing every time and checkpointing only makes the entry of SCN recorded that particular time. Now
1) what is making this SCN to be readily incremented? is it with time mainly?
2) I thought it is only incremented by transaction complete events(commit/rollback). But if it is only increases by time then how can I it be differentiated by whether the transaction is completed or not?
3) Also I have seen in some docs that CKPT process updates the controlfile and datafile header at every 3 secs. Is it true? (as you told earlier that checkpointing happens by various means and CKPT process to me, updates controlfile and datafile header all the time about checkpoint_change# made last time)

September 20, 2010 - 1:52 pm UTC

1) the database is never inactive when it is running. smon, pmon, all of the backgrounds are constantly doing things (things being transactions). You are just seeing the backgrounds themselves committing.

2) you are seeing completed transactions - OUR completed transactions. It is not based on time moving forward, it is based on commits/rollbacks taking place.

3) it may update the control files but it doesn't need to update the datafiles every three seconds. controlfiles will see more or less continuous activity.

and on a normal system with activity - not your standalone database doing nothing - checkpoints are naturally advancing on a regular basis in response to your fast_start_mttr_target, other init.ora parameters, redo log activity, buffer cache filling up and so on.

you are overanalyzing this. What you need to understand is:

a) a checkpoint of a block will take place at some time and when it does, we no longer need the online redo log that protects that block for instance recovery (we only need it for media recovery)

b) the SCN is like a clock, constantly advancing in response to "transaction completed"

Checkpointing

Arindam, September 21, 2010 - 1:54 pm UTC

Hi Tom,

This is pretty much clear now. But one last question.
If CKPT updates the controlfile at every 3 secs then it shouldn't mean that checkpointing also being done at every 3 secs. I mean CKPT only updates the controlfile about the last checkpoint_change# recorded at every 3 secs. Isn't it right?

September 21, 2010 - 4:12 pm UTC

.. If CKPT updates the controlfile at every 3 secs then it shouldn't mean that
checkpointing also being done at every 3 secs. ...

why would it? the control files contain a lot of information.

but why would the act of recording something "trigger" that something? You have the cause and effect backwards.

Checkpointing

A reader, September 21, 2010 - 11:01 pm UTC

Hi Tom,

My checkpoint_change# at any point of time is suppose 12000. We know that if log buffer fills up then it may flush out some uncommitted change to online log group. So that time we have uncommitted log in the online log. Now, if we switch the log group manually or if it switches automatically then we would expect a checkpoint trigger which asks DBWn to write all dirty blocks (protected by that online log) which also includes some uncommitted data to disk.
So if that time SCN is 12030, would the CKPT process write that checkpoint_change#:12030 on datafile header since it contains uncommitted changes as well?

As per your comment:
"a checkpoint of a block will take place at some time and when it does, we no longer need the online redo log that protects that block for instance recovery (we only need it for media recovery)".
So we would not need any instance recovery till 12030. Thus can we expect the uncommited change is permanent now?

I hope I have made you aware correctly about the situation.

September 23, 2010 - 9:52 am UTC

the datafiles almost always contain uncommitted data - uncommitted data that is protected by UNDO so it can be rolled back. I think you are getting too wrapped up in minutiae here.

SCN = clock, like a wall clock. always advancing. It is time.

checkpoint change# = a point in time, like "I had lunch at noon".

At a point in time, the database is in a certain state - with some set of transactions finished and some in flight. It knows that "at noon" what set of transactions were finished and which were in flight. If it CRASHED - it would put the database back to the way it was at noon and then begin to rollback the uncommitted (in flight) transactions. This is what you and I need to understand.

You would need instance recovery till 12030++ - because we do not checkpoint in a "single event", it takes a long time (computer time wise it does) - all of the files are "as of different points in time", by the time one file is up to 12030 - TIME HAS KEPT ON GOING - the files are always lagging behind reality - transactions do not stop simply because we are checkpointing - time keeps on going.

Checkpointing

A reader, September 23, 2010 - 12:38 pm UTC

Thanks Tom, Nice answer.

so I would expect if instance doesn't fail/crash but if I rollback that uncommitted change, then the checkpoint_change# 12030 will not decrease.

September 27, 2010 - 9:46 am UTC

correct, just like your lunch would still have been at 12 noon, even if you get sick later on...

checkpointing

A reader, September 28, 2010 - 10:17 pm UTC

Hi Tom,

Suppose I have checkpoint_change# at any point of time T1 is 12000.So that means all the datafile header is updated with SCN 12000 at T1. Now, I am making an update in the DB buffer cache for a block which is a part of Datafile D1 and commits it at T2. Should the block hold the then SCN (say 12030) at T2? If yes then why?

Now another transaction does some other work for block of datafile D2 at T3 (>T2), commits it. If the above scenario is true then it should contain SCN say 12050.

Now if the DB cache requires some space at T4 while SCN is 13000, then it will ask the LGWR to flush out some logs which in turn makes the DBWn to flush out concerned blocks protected by those logs. If those blocks which are flushed also include block of D1(with SCN 12030) and block of D2(with SCN 12050) datafiles, then when checkpoint will happen what will be the ultimate checkpoint_change# for all the datafiles? Will all the block header of all the datafile be incremented to SCN 13000 at T4?

September 29, 2010 - 8:39 am UTC

blocks don't hold SCN's truly. A block has a transaction history and from the transaction history we can determine the "time" of the block.

But yes, the block that contained committed data as of T2 would have an associated SCN of t2 - why - because it was modified then, what other time could it have???

You cannot answer your last question, nor do you *need* to. You are going way too far into this - it is not necessary for any of us to go here.

The ultimate checkpoint change numbers would reflect whatever was just checkpointed. As I've said over and over - it is just a TIME. It is a point time that the files have all of the changes up to. If the checkpoint wrote out everything up to time 12048 - that is what it would be.

checkpoint

A reader, October 02, 2010 - 1:40 am UTC

Hi Tom,

1) Does "alter system checkpoint" signal DBWn to flush data block images for only committed transaction?

2) Why do we need "object checkpointing"?

3) LOG_CHECKPOINT_TIMEOUT: oracle will check what is written that time (say 60 sec) ago in log file and advance checkpoint automatically so that the time gap between checkpoint position and content of log file progress always becomes that time.
LOG_CHECKPOINT_INTERVAL: oracle will check what is written that much records ago (say 1000) ago in log file and advance checkpoint automatically so that the gap between checkpoint position and content of log file progress always becomes that much record.

Am I correct?

October 04, 2010 - 1:42 am UTC

1) no, checkpointing is the activity of getting blocks on to disk from the buffer cache that have changes that are protected by redo.

It does not matter if the blocks having these changes have only committed changes or not. The only thing that matters is

a) the blocks are in the cache
b) the blocks have changes on them that are protected by redo
c) the changes made to the block are not reflected on disk in the datafiles yet.

2) sometimes we want all blocks on disk for a given segment. This typically happens with parallel query. Parallel query doesn't want to use the buffer cache in most cases (horribly inefficient to have to look for a million blocks in the cache - more efficient to read them from disk). So, before parallel query goes - it will issue a segment checkpoint to make sure the most current version of the block is on disk and the cache can be bypassed. Other operations like 'truncate' use this as well.

3) http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:344617462392

object checkpoint

A reader, October 04, 2010 - 1:42 pm UTC

Hi Tom,

Can you please explain how parallel processing can bypass the cache? what type of command invokes object checkpoint or is it invoked internally?
Also why truncate or drop command uses object checkpoint?

October 05, 2010 - 5:51 am UTC

well, if we make sure the current version of the block we need is on disk by checkpointing, we don't need to look in the cache (any later versions of the block wouldn't be used by us anyway - read consistency and all). so, checkpoint the segment - get the copy we need on disk - and blow off the cache which would just slow us down...

the command is internal.

truncate and drop need to get the blocks on disk because they do not generate undo - so if they need to be undone (database crash before they finish for example, or we discover that we cannot drop the table for some reason), everything we need must be on disk so we can "undo the drop/truncate bits we've done so far".

A reader, October 05, 2010 - 9:32 am UTC

<quote>
..everything we need must be on disk so we can "undo the drop/truncate bits we've done so far".
</quote>

You mean data dictionary changes right?

Also does truncate and drop generate redo?

Thanks

October 05, 2010 - 12:24 pm UTC

no, I mean the table data itself.

If the database crashed 99.9% of the way through the drop table - we must be able to restore the entire table back the way it was before the drop. We checkpoint it to disk so we can do that - regardless.

The dictionary changes are protected by redo and undo, the dropped table segment IS NOT.

drop and truncate generate redo - FOR THE DICTIONARY changes...

A reader, October 05, 2010 - 1:38 pm UTC

<drop and truncate generate redo - FOR THE DICTIONARY changes... >

and also redo for undo in case of failure to revert back the changes related to dictionary

so what I mean from your statement below

<everything we need must be on disk so we can "undo the drop/truncate bits we've done so far". >

is we are undo the change that made so far related to data dictionary if truncate/drop fail

Thanks

October 06, 2010 - 4:33 am UTC

... and also redo for undo in case of failure to revert back the changes related to
dictionary ...

that is what I said - that would fall into the area of 'dictionary changes'.

... is we are undo the change that made so far related to data dictionary if
truncate/drop fail
...

also would seem to go without saying - it would be a necessary step to take to "undo the drop/truncate bits we've done so far".

instance recovery and checkpoint dependencies

A reader, January 26, 2011 - 4:03 am UTC

Hi Tom,

I have couple of questions regarding checkpoint:

1) Does instance recovery is performed to fill up the gap between the checkpoint and SCN in redo log at any point of time?
2) Does redo log and control file holds almost the same SCN or different?
3) For committed transactions, the dirty blocks in db buffer are written in Datafiles one after the other whenever there is any space requirement in DB buffer cache. Do every writing by DBWn during this case causes CKPT to write the SCN in datafiles and controlfiles?

February 01, 2011 - 2:31 pm UTC

1) instance recovery will use checkpoint information to discover what redo entries need to be applied to database blocks, yes.

2) the redo log is a continuum - it is a stream of information. The control file is a "state", an observation. The control file/datafile headers tell you where in the redo stream you need to read.

3) as blocks are checkpointed, we update information in the control files/data file headers, yes.

Checkpointing

A reader, February 02, 2011 - 1:05 pm UTC

Hi Tom,

Can you please elaborate in #2 above how datafile can tell which stream of redo log we need to read? (I can understand the controlfile part though that it stores RBA and redo log sequence).
Also for #3 above will checkpoint happen when uncommitted data blocks (dirty) needing space which are flushed to disk by DBWR? Is SCN at that time is also written in datafile header and control file?

February 02, 2011 - 1:30 pm UTC

datafile headers have this information too, the checkpoint scn information. We can recover a database without the control files (we can rebuild them and then apply as much redo as we want, the datafiles know what they need)

the scn the datafiles are as of are recorded in their headers, yes. v$datafile is built on this information.

Redo Log put on hold till DBWn completes..

Zvi, July 29, 2011 - 3:01 pm UTC

Hi Tom,

This is quoting from your book "Expert Oracle Database Architecture":

"The database buffer cache is where database blocks are stored temporarily. This is a structure in Oracle’s SGA. As blocks are read, they are stored in this cache, hopefully so we won’t have to physically reread them later. The buffer cache is first and foremost a performance-tuning device. It exists solely to make the very slow process of physical I/O appear to be much faster than it is. When we modify a block by updating a row on it, these modifications are done in memory to the blocks in the buffer cache. Enough information to redo this modification is stored in the redo log buffer, another SGA data structure. When we COMMIT our modifications, making them permanent, Oracle does not go to all of the blocks we modified in the SGA and write them to disk. Rather, it just writes the contents of the redo log buffer out to the online redo logs. As long as that modified block is in the buffer cache and not on disk, we need the contents of that online redo log in case the database fails. If, at the instant after we committed, the power was turned off, the database buffer cache would be wiped out. If this happens, the only record of our change is in that redo log file. Upon restart of the database, Oracle will actually replay our transaction, modifying the block again in the same way we did and committing it for us. So, as long as that modified block is cached and not written to disk, we can’t reuse that redo log file."

All seems perfectly logical to me except for the last sentence. Why would Oracle wait for DBWn to complete its task? All that is necessary for recovery exists already in redo log. Isn't that enough? (Or is releasing cache blocks so holy?..)

August 01, 2011 - 11:46 am UTC

we are talking about reusing a redo log.

So, as long as that modified block is cached and not
written to disk, we can’t reuse that redo log file.

before we can reuse that log file, dbwr has to have written all blocks protected by redo in that log file to disk.

wouldn't a redo log entry be enough?

Zvi, August 04, 2011 - 11:59 am UTC

Hi Tom,

In your response you wrote:

"So, as long as that modified block is cached and not
written to disk, we can’t reuse that redo log file."

My question is: why?

Isn't the mere existence of a redo log entry ensures a data block change (whether or not this block been written to disk)? Because if an instance terminates abnormally before Oracle writes block to disk, recovery will still be possible due to a redo log entry, wouldn't it?

August 04, 2011 - 6:46 pm UTC

If we reuse the redo log - we would not have the redo log entry anymore.

So, the existence of the redo would be "not existent anymore".

That is the entire point - we have to keep that redo UNTIL the block it protects is on disk.

If we were to overwrite it - to reuse that redo log file - before we didn't need it anymore - we WOULD NOT be able to recover it with redo (because the redo doesn't EXIST)

Zvi, August 05, 2011 - 3:13 pm UTC

Tom,

Thanks for making our livelihoods a lot more easier. You wouldn't even start to imagine the huge help you've been giving to us all.

You wrote:

"If we reuse the redo log - we would not have the redo log entry anymore.

So, the existence of the redo would be "not existent anymore".

That is the entire point - we have to keep that redo UNTIL the block it protects is on disk.

If we were to overwrite it - to reuse that redo log file - before we didn't need it anymore - we WOULD NOT be able to recover it with redo (because the redo doesn't EXIST)"

Question:

When you say "redo log reused" would that mean that it was first archived? (as it should be..)

August 06, 2011 - 7:54 am UTC

we have two more online redo logs always. assume we have 2.

Assume, we are writing to redo log 1 right now. The redo in this log file protects some blocks in the SGA. It is needed to recover the changes to those blocks if the instance failed right here, right now.

Assume we just filled redo log 1 and moved into redo log 2. In the background - dbwr will be writing out to disk the blocks that are protected by the redo in redo log file 1.

Now, assume redo log 2 has filled and we need to advance back into redo log 1. IF there are some blocks in the buffer cache that are still protected by the redo in redo log 1 - we cannot yet reuse that redo log file. We pause, we let dbwr finish and then start using it.

Archival didn't come into play. During instance recovery we only require online redo log files. We NEVER use archives during instance/crash recovery. We only use archives during media recovery (recovery from backups).

Also, the redo log may or may not be archived by the time we try to advance into it (addressing your 'as it should be...' comment). It may or may not be archived.

First - we might not be in archivelog mode - we cannot count on it even being archived.

Second - arch might not have finished archiving it. If we are in archivelog mode and arch didn't finish - the database would pause and wait for arch as well.

cyclical redo logs

Zvi, August 06, 2011 - 3:54 pm UTC

Hi Tom,

Thanks for your swift and detailed response.

I'm ashamed to admit it, but the cyclical manner of redo logging has totally slipped my mind..

Your help is precious.

Many thanks,

Zvi

Checkpoint

harsimran, January 24, 2012 - 12:17 pm UTC

Hello Tom need to know one thng like if we have 3 redo log group rg1 , rg2, rg3. If after 3 seconds when lgwr moves the change vector into rg1, when it full then it will switch to rg2 after that it will switch to rg3. when it switch from rg1 to rg2, check point is occured. when checkpoint occurs data will move from buffer cache to datafiles.
Now my question is if the changes data is uncommitted, when instance fails then data that moved to datafiles already during checkpoint will be comitted automatically or will be lost when instance fails.

January 24, 2012 - 1:50 pm UTC

Now my question is if the changes data is uncommitted, when instance fails then
data that moved to datafiles already during checkpoint will be comitted
automatically or will be lost when instance fails.

has nothing to do with a checkpoint really - any data that was not committed will be rolled back during the crash recovery for that instance. It will not be committed.

checkpoint

MANNOJ S V, February 15, 2012 - 10:33 pm UTC

Hi tom,
You said,
When you commit -- the block is not
written (but the REDO LOG)

If i have 10 REDO LOG file and all REDO LOG file are in active.
My Question is,
1) How it will update the changes to disk. 10 Redo Log will have different SCN. (Since checkpoint is ACTIVE in 10 Redo log file)

2) Is a single checkpoint process that reads all the REDO content or Multiple process (ckpt)

February 16, 2012 - 7:14 am UTC

1) I don't really know what you mean.

Redo is written to in a circular manner - so redo log 1 would be written to, then 2 then 3 and so on. They tell a story - in order - of what happened to the database. They are capable of replaying that story (redoing that story).

2) a checkpoint doesn't really read redo, a checkpoint flushes buffers from the SGA to the datafiles.

Checkpoint

MANNOJ S V, February 16, 2012 - 8:38 am UTC

Hi tom,

You said,
a checkpoint doesn't really read redo, a checkpoint flushes buffers from the SGA to the datafiles.

My Question,
1) When we give alter system checkpoint.
Redolog file status change from Active to Inactive.

What it mean.

2)Then how checkpoint knows the latest SCN .

February 16, 2012 - 10:27 am UTC

1) that means we no longer need that redo log file for crash recovery - every block that redo log file was protecting is safely on disk. We can reuse that redo log file.

2) the same way the "database" knows. But, yes, the checkpoint process will gather information about a redo log file - but it doesn't really read the entire thing to process it or anything. It just needs a tiny bit of basic information. The thrust of a checkpoint is not reading redo logs - it is writing dirty blocks.

Checkpoint v/s commit

V V, March 07, 2013 - 12:23 pm UTC

Does a commit ALWAYS trigger a Checkpoint ?
If no, then for the same data, won't we have a different SCN# in the redolog and the datafile headers ?

If yes, then there is no need for the un-committed checkpoints to the datafile as it is going to be rolledback anyways. Is it for the performance sake that the periodic checkpoint of the un-committed data happens ?

Can you please clarify this doubt if you don't mind ?

Thanks

V V

March 07, 2013 - 1:04 pm UTC

A commit does not trigger a checkpoint, no.

We write to the redo log when you commit, the redo log is a sequential record of what happened. Using the redo in order - we can replay your transactions that have not been written to the datafiles.

We buffer the datafiles - knowing that if we crash - we can reconstruct them from the sequentially written to redo log file.

It is for data integrity that periodic checkpoints happen. Before we can overwrite redo in an online log file, we need to make sure all blocks protected by that redo are safely on disk.

uncommitted transactions are written during checkpoints as well, a rollback is a logical operation - not a physical operation. A rollback changes a block just like an insert does. If you insert 100 rows and rollback - we actually go out and delete 100 rows to get rid of them. If you update 100 rows and rollback - we go out and update the same 100 rows to put them back. That generates undo and redo and modifies blocks. they are all checkpointed at some time.

It is also a performance thing - checkpoints are a bunch of single block, random writes (opposed to redo, big sequential writes) - and happen in the background, not in a user process.

A reader, March 29, 2013 - 10:05 am UTC

whether fire checkpoint when execute drop user statements

A reader, July 15, 2013 - 10:37 am UTC

hi,teacher tom:
when execute drop user statesment,whether it fires checkpoint?
how can i prove it?

July 16, 2013 - 4:38 pm UTC

it wouldn't fire a general checkpoint, it'll fire segment level ones for every segment that is dropped (we have to get them out of the buffer cache onto disk before we can drop them)

dropping a user just runs a bunch of SQL. it wouldn't fire a checkpoint.

Status of modified data block.

Mahfuz, October 02, 2013 - 1:56 pm UTC

Hi tom,

First i thank you for your effort to oracle community.I found your website is most helpful and i learn a lot from here.

My question is 'if a user update some rows of a table ,checkpoint initiates before user commits.what happen to the modified blocks when checkpoint started.'

Thanks

Live SQL

Share and learn SQL and PL/SQL; free access to the latest version of Oracle Database!

Dev Gym

Classes, workouts and quizzes on Oracle Database technologies. Expertise through exercise!

checkpoint

Breadcrumb

Question and Answer

You Asked

and Tom said...

Rating

Comments

instance recovery

from rollback segments

Checkpoint Queue

checkpoint

checkpoint

Few Clarifications

not clear this part

thx for prompt reply

Commited data not "guaranteed" ?

JFS

datafile - data and header

another question....

question

cont../

checkpoint process and datafiles

checkpoint scn

alter system checkpoint

alter system checkpoint

Checkpoint not complete

incremental checkpoint

Instance Recovery

Instance Recovery

incremental checkpoint

incremental checkpoint

Uncommited Data to Disk.

how actually dbwr works?

a bit more information

RE : Log switch

Redo,Undo,Checkpoint

"Reverse" recovery - why isn't it possible?

reusing redologfiles

what if dataabse is in noarchivelog mode

Thanks...

A small doubt regarding the same..

Undo protected by redo

Not getting it

RE: checkpoints and Mviews

checkpoints and Mviews

RE: redo log optimal size

RE: Negative redo size

at shutdown time

need some clarification

reader

reader

READER

Hmmm

I agree

dirty buffers

dirty buffers

dirty buffer confusion

Addition

More info

lots of CR blocks

Thanks

-

Follow up

Thank You

Writing to Redo and not to Datafile

Question

checkpoint

difference RBA AND SCN

difference RBA AND SCN

difference RBA AND SCN

cursor

drop instance

automatic segment-space management

asm

asm

automatic segment-space management

Database crashes while huge insertion process is going on

Checkpoint and Undo

Checkpointing

Checkpointing