A reader, April 30, 2008 - 1:18 pm UTC
Thanks Tom for the reply.
Crash/Instance recovery=rolling forward+ rolling backward.
If rolling backward occurs after database open then why in the alert log it signals crash recovery complete before signalling
Completed:Alter database open?
I mean nothing is recorded in the alert log for the rollback phase.
Why is that so?
April 30, 2008 - 3:23 pm UTC
crash recovery = rolling forward.
rolling back just sort of happens afterwards.
rolling forward = we are consistent. but we have some transactions that did not commit, they will never commit, so we will roll them back.
we can do stuff while they are rolling back, rolling back is sort of a "normal thing"
only rolling forward is "special"
instance recovery
rizwan, August 19, 2011 - 3:26 pm UTC
1)What redo does insert generate
Block id in which row is getting inserted + values of columns + rowid of new row ?
If this is so , does oracle during rollforward phase try to get the same block id from datafile which was in buffer cache (or may not ) when sytem crashed or it just gets any availbale block for insert ..
2) From where does it start reading online redo log files after system crash (is it after controlfile scn ) ?
August 30, 2011 - 1:13 pm UTC
1) redo is always a 'change vector'. We record a blob of data that describes what bytes to change on a block in order to 'redo' that operation.
It is a binary operation - it will always be redone to the same exact block every time it is applied.
2) it starts reading as far back as it needs to in order to perform a crash recovery. This will depend on how far the latest checkpoint got.
'fast-start on-demand rollback'/'deferred transaction recovery'
A reader, December 18, 2011 - 9:30 pm UTC
I just noticed a terminology 'fast-start on-demand rollback/deferred transaction recovery'.
It is all about the rollback phase of crash recovery. I think it is pretty much same as what you mentioned '(we'll roll back blocks on demand in this mode). '
Per my understand, it make ORACLE available to user quicker rather than waiting for the 'rollback' phase to finish.
So my only confusion here is:
1> if i have a active transaction and want to switch current UNDO tablespace with below, will hit error
create undo tablespace new ...
alter system set undo_tablespace=new;
alter tablespace old offline;
ORA-30042: Cannot offline the undo tablespace
2> at normal case after 'shutdown immediate', if we 'startup', there should not be any 'UNEXPIRED' and 'ACTIVE' UNDO segement anymore
so i can easily switch undo tablespace like in 1>
3> if 'fast-start on-demand rollback' is always enabled by default, i am thinking after 'shutdown abort' and 'startup', steps in 1>
should hit error as old UNDO is still necessary, but my testing shows it can successfully switch UNDO, why?
December 19, 2011 - 7:31 am UTC
you can switch the undo tablespace, you just wouldn't be able to get rid of it until we are done with it. We'll continue to use the 'old' one until we are done with it.
if you shutdown abort a 'boring' database, the crash recovery will be so fast - it will appear as if we didn't need the undo. Run a couple dozen really big transactions and do not commit them, then abort. Then you'll see some activity on the "old" undo tablespace.
Hi Tom
Lokanath, May 02, 2012 - 12:41 am UTC
Hi Tom,
1. To do the crash recovery , Database needs datablocks(needed recovery) to be brought into SGA. May i know which process will bring them?.
Thanks
May 02, 2012 - 1:23 pm UTC
smon will read them and apply redo to them during the recovery phase.
Crash Recovery
Chinmaya, April 23, 2013 - 12:06 pm UTC
Hi Tom,
Oracle document mentions "Archived logs are never used during crash recovery". My doubt is " if a Server has crashed and has not been up for lets say 24 hours. In this 24 hours Production is started from Backup server. Can I collect the Redo/Archive Logs from this backup server and apply on Production server after it is up and Oracle performs crash recovery?
April 23, 2013 - 1:19 pm UTC
crash recovery is a crashed instance. It is not a crashed set of disks.
Ok, it sounds like you are saying:
a) machine1 failed, it is not avaiable. it was running database1, the online redo logs and everything are inaccessible.
b) we started machine2, we put a copy of database1 on this machine - this copy is now database2.
c) we are running on machine2 with database2
d) machine1 becomes available - can we apply redo from machine2 to machine1 to catch it up.
if so, the answer is NO, database1 and database2 are different databases.
If on the otherhand you are running data guard and you did a SWITCHOVER from machine1 to machine2 (the online redo logs on presumably shared storage were accessible so we could switchover and not lose any work) - then yes, we can switch back.
Crash Recovery
Chinmaya, April 23, 2013 - 1:46 pm UTC
Heres a better example:
1. We have a Oracle Database (P3B) on Server 1.
2. The Database is Cloned everyday , and mounted on backup Server, (Server 2) for backup and also for failover if needed.
3. Now, lets say, the entire Server 1 just crashes.
4. We leave the server 1 like that, and start production from Server 2 as we have Cloned copy of the Database (P3B)
5. Production on Server 2 continues for 4 hours and have new set of Redo/Archive Logs
6. After 4 hours, we gracefully shut down the Server2, and Boot the Server1.
7. Since, this will be the first boot after the server 1 crashed, Oracle will perform Automatic Crash recovery.
8. Once the Crash reovery is done, can I apply the Arhive Logs from server2, here to bring it to the point where we shutdown server2 gracefully?
Thanks a lot for you help.
April 23, 2013 - 2:59 pm UTC
what is "cloned"
but anyway, since the clone is as of some point in time in the PAST (the cloned was taken at (2), we are at time (3) now). The database that just crashed is at time (3), the clone is at time (2).
you have just lost everything that happened since (2) when you start the clone (I hope you realized that) and when you opened the clone and started transacting against it -it become its own database with its own redo threads. Those redo threads cannot be applied to the crashed database.
however, if you use actual disaster recovery features like data guard and you run in maximum protection mode, we can failover and failback like you desire. But it takes using disaster recovery software specifically designed to do this.
your scenario is more or less a mirror image of what I described above.
Chinmaya, April 23, 2013 - 2:14 pm UTC
Or, one more scenario would be,
The devices on which the Database resides is also presented to another server (server2).
After Server1 crashes, we start the production from Server2.
Production on Server 2 continues for 4 hours and have new set of Redo/Archive Logs.
After 4 hours, we gracefully shut down the Server2, and Boot the Server1.
Since, this will be the first boot after the server 1 crashed, Oracle will perform Automatic Crash recovery.
Once the Crash reovery is done, can I apply the Arhive Logs from server2, here to bring it to the point where we shutdown server2 gracefully?
And thats my question, can we apply Archive/redo logs to the database immediately after a Crash Recovery?
I am extremely sorry if this is spamming your page..
April 23, 2013 - 3:05 pm UTC
if the devices are present on both machines, and you just start up the SAME database on another machine, there will be no need to apply anything. The database is already caught up.
You would just shutdown on machine2 and startup on machine1 - nothing special needs to be done.
Instance crashing
rizwan, November 17, 2013 - 7:33 pm UTC
Hello Tom,
We are getting below error while taking inc0 backup every week.
Details are as below :
Version details :
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
System name: SunOS
Release: 5.10
Version: Generic_147147-26
Machine: sun4u
Error details :
error log :
----------
Sat Nov 09 04:23:08 2013
Errors in file /oracle/PEN/saptrace/diag/rdbms/pen/PEN/trace/PEN_lgwr_11067.trc (incident=252501):
ORA-00239: timeout waiting for control file enqueue: held by 'inst 1, osid 11069' for more than 900 seconds
Incident details in: /oracle/PEN/saptrace/diag/rdbms/pen/PEN/incident/incdir_252501/PEN_lgwr_11067_i252501.trc
opidrv aborting process LGWR ospid (11067) as a result of ORA-603
Sat Nov 09 04:23:09 2013
Sweep [inc][252501]: completed
Sweep [inc2][252501]: completed
Sat Nov 09 04:23:11 2013
PMON (ospid: 11024): terminating the instance due to error 470
Sat Nov 09 04:23:12 2013
System state dump requested by (instance=1, osid=11024 (PMON)), summary=[abnormal instance termination].
System State dumped to trace file /oracle/PEN/saptrace/diag/rdbms/pen/PEN/trace/PEN_diag_11051.trc
Instance terminated by PMON, pid = 11024
PEN_diag_11051.trc:
-------------------
Process Group: DEFAULT, pseudo proc: 0x6d265f920
O/S info: user: penadm, term: UNKNOWN, ospid: 11069
OSD pid info: Unix process pid: 11069, image: oracle@encsappenci (CKPT)
I found two metalink block for which i Already took action but no luck . Can you please help
**********************************************************
1.Database Hung With ORA-00239: Timeout Waiting For Control File Enqueue (Doc ID 1389274.1)
Symptoms
===========
+ Alert log reported "ORA-00239: timeout waiting for control file enqueue".
+ Database was hung with no redo log file switches until the database was bounced.
Cause
=======
This is caused by the following internal unpublished bug:
BUG 10623249 - HIGH "LATCH FREE" WAIT EVENT WHEN RESOURCE MANAGER IS USED
+ The process waiting for the Control File enqueue was blocked by Process 'inst 1, osid 27319'.
+ This blocker was waiting for 'latch free' for at least '15 min 29 sec'. This is the base blocker.
+ The blocker's top call stack functions are: kslges kgskglt
+ The needed latch is a Resource Manager latch.
Solution
============
Please upgrade to 11.2.0.3 in which the relevant bug is fixed.
- Already running on 11.2.0.3
or
Disable the resource manager on top of 11.1.0.7 as follows:
$ sqlplus / as sysdba
alter system set RESOURCE_MANAGER_PLAN='';
- Already disabled as a first action steps.
###############################################################################################
2.Disk I/O Contention/Slow Can Lead to ORA-239 and Instance Crash (Doc ID 1068799.1)
Symptoms
=========
The following error occurred and the instance was terminated.
ERROR
-----------------------
Tue Feb 23 04:12:23 2010
Thread 1 cannot allocate new log, sequence 15947
Checkpoint not complete
Current log# 1 seq# 15946 mem# 0: /oradata/CNOPRPT0/redo01.log
Errors in file /oracle/diag/rdbms/cnoprpt01/CNOPRPT01/trace/CNOPRPT01_lgwr_2307.trc (incident=432582):
ORA-00239: timeout waiting for control file enqueue: held by 'inst 1, osid 2309' for more than 900 seconds
Incident details in: /oracle/diag/rdbms/cnoprpt01/CNOPRPT01/incident/incdir_432582/CNOPRPT01_lgwr_2307_i432582.trc
ORA-603 : opidrv aborting process LGWR ospid (2307_3545232)
Tue Feb 23 04:12:29 2010
Errors in file /oracle/diag/rdbms/cnoprpt01/CNOPRPT01/trace/CNOPRPT01_pmon_2287.trc:
ORA-00470: LGWR process terminated with error
PMON (ospid: 2287): terminating the instance due to error 470
Instance terminated by PMON, pid = 2287
Cause
======
I/O contention or slowness leads to control file enqueue timeout.
One particular situation that can be seen is LGWR timeout while waiting for control file enqueue, and the blocker is CKPT :
From the AWR:
1) high "log file parallel write" and "control file sequential read" waits
2) Very slow Tablespace I/O, Av Rd(ms) of 1000-4000 ms (when lower than 20 ms is acceptable)
3) very high %iowait : 98.57%.
4) confirmed IO peak during that time
Please note: Remote archive destination is also a possible cause. Networking issues can also cause this type of issue when a remote archive destination is in use for a standby database.
Solution
==========
Since the problem is caused by I/O contention/slow, improving the I/O performance will avoid such issue.
It is an expected behavior that the control file enqueue requestor terminates itself with ORA-239 if the state machine failed to kill the blocker. In most cases, crashing the instance will prevent cluster-wide hang. This is the default setting.
--------------------------------------------------
There is a workaround to resolve the issue:
Disable kill blocker code via setting _kill_controlfile_enqueue_blocker=false.
- This setting is already exists.
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
_kill_controlfile_enqueue_blocker boolean FALSE
or
@ To prevent BG blocker from being killed, you can set the following
@ init.ora parameter to 1 (default is 3).
@ .
@ _kill_enqueue_blocker=1
- This already set to 1.
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
_kill_enqueue_blocker integer 1
Amit Rath, July 05, 2019 - 11:34 am UTC
Hi Tom,
Just asking a question on your sample test. Below is the test you mention :-
so, let us say you had a session opened.
you updated 1,000,000 rows in a table
you have NOT committed.
now, you shutdown abort.
when we restart we will
a) apply the redo you generated during your 1,000,000 row update, replaying your 1,000,000 row update. The database now looks like it did right at the time of the "crash"
b) open the database - people can log in now
c) start rolling back your 1,000,000 row update. In fact, if someone wants to lock a row in the middle of your 1,000,000 rows - we can do a block level rollback and give them the appearance of instantaneous access (we'll roll back blocks on demand in this mode).
#########################
My question is let's suppose I inserted 1000000 records and didn't commit, due to space crunch in data buffer , my uncommited data moved to datafiles, and now my instance crashed.
Now in this case how the roll forward will happen?
Thanks
Amit Rath
July 08, 2019 - 3:05 am UTC
My question is let's suppose I inserted 1000000 records and didn't commit, due to space crunch in data buffer , my uncommited data moved to datafiles, and now my instance crashed.
We don't care if your data was in memory or on disk or anywhere else :-) All that matters is that we have the redo logs. The redo logs will contain
- all the information to roll the database forward
- that rolling forward brings the *undo* segments up to date as well, which can then be used to roll back any transaction.
The datafiles can be clean or dirty, it doesn't really matter - it just impacts the amount of I/O effort we'll need to do to clean out the mess