Thanks for the question, Ashish.

Asked: August 04, 2005 - 11:17 am UTC

Last updated: August 20, 2005 - 3:30 pm UTC

Version: 9.2.0.5

Viewed 1000+ times

You Asked

Hi Tom,

We are using DBMS_JOB to achieve multiprocessing. We submit 9 jobs simultaneously. This code ran absolutely fine in production for 15 days and then suddenly one day 2 jobs got suspended and didn't execute at all for 6hrs. We restarted the database server and the jobs resumed execution. Subsequently this problem has come up 3 more times at random intervals.

The jobs were not broken also.

Upon checking the active session sql, we found following active statements
DECLARE job BINARY_INTEGER := :job; next_date DATE := :mydate; broken BOOLEAN := FALSE; BEGIN CS_DOWNLOAD.SPUPLOADRECEIPTS(0,JOB,274); :mydate := next_date; IF broken THEN :b := 1; ELSE :b := 0; END IF; END;

DECLARE job BINARY_INTEGER := :job; next_date DATE := :mydate; broken BOOLEAN := FALSE; BEGIN CS_DOWNLOAD.SPDOWNLOADBNK(0,JOB,274); :mydate := next_date; IF broken THEN :b := 1; ELSE :b := 0; END IF; END;

We've logged the TAR with metalink, but their response is that its an application design issue, as it may be creating library latches.

What could be the issue and where to start looking for the bottleneck? I am not sure, how it could be an issue with application, when the 2 jobs are completely independent and access different tables.

Any pointers from you will be of great help. We have trace files generated from oracle hanganalysis (suggested by metalink), which I can send across to you, if you are willing to take this question further.

Thanks

and Tom said...

If you found that sql "active" then - the jobs where NOT suspended, the jobs where in fact *executing*.

Your procedures where in fact running -- perhaps they had logic issues that caused them to get blocked in some fashion -- but the problem here doesn't seem to be the job queues (if they were active, they were not suspended).

You should analyze them to see what they are doing when they get "stuck", what are they waiting on (they must be waiting for something) or are they running away in an infinite loop. Do you use dbms_application_info to set_client_info or better yet, set_session_longops to be able to monitor them? Are these traceable (eg: have you instrumented your code? Is your code "debugable" in the fashion Oracle is with traceing and events and such?)

Rating

(10 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Ashish, August 05, 2005 - 12:01 am UTC

Thanks Tom for the prompt response.
The first line in the procedures is dbms_application_info.set_client_info. But its not visible at all. We are also using instrumentation code (we insert into a table)

I've understood your point. It has nothing to do with JOBS. The procedures are lined up for execution but are waiting for something. How to find it? The package has procedures which insert data from db-link into different tables.
mentioned below is some info by metalink. I couldn't understand much of it

The systemstate shows:-

PROCESS 23:-

waiting for 'library cache pin' blocking sess=0x0 seq=2356 wait_time=0
handle address=19539ff0, pin address=17739cc4, 100*mode+namespace=12e
temporary object counter: 0

SO: 17739CC4, type: 52, owner: 184601B4, flag: INIT/-/-/0x00
LIBRARY OBJECT PIN: pin=17739cc4 handle=19539ff0 request=X lock=0
user=16991510 session=16991510 count=0 mask=0000 savepoint=155 flags=[00]

This is held by process 25 (image: ORACLE.EXE (J003)):-

SO: 1B0A6EE8, type: 52, owner: 17186CB0, flag: INIT/-/-/0x00
LIBRARY OBJECT PIN: pin=1b0a6ee8 handle=19539ff0 mode=S lock=17738254
user=1715e9b4 session=1715e9b4 count=4 mask=0011 savepoint=151 flags=[00]

This process is on the CPU at this time (so will not release the resource until it has completed its CPU cycle)

last wait for 'latch free' blocking sess=0x0 seq=9570 wait_time=3
address=17309f78, number=62, tries=0
temporary object counter: 0

The handle is name=WEBC_DATA.CS_DOWNLOAD

The problem is that the package is INVALID (in the context of the shared pool):-

LIBRARY OBJECT: object=1953bc54
type=PKBD flags=EXS/LOC[0005] pflags=NST [01] status=INVL load=0
DEPENDENCIES: count=84 size=96

I can see SQL such as update tab$, update obj$ being called in this session - which co
ncerns me a little as this would cause an invalidation (it explains the invalida
tion but it should not hang)

So - process 25 is doing something to revalidate the package -if it is spinning of the CPU during that time then it may be related
to the fact we will having to contact the remote end via the DBLINK.

We, therefore , need to do the following (next time this happens) - there is not enough e
vidence to conclude anything from this systemstate:-

1) Dump 2 systemstates (do one, leave it for 10 seconds and then do another) and edit it and search for 'l
ast wait' - you should find the line is similar to the 'last wait' line above. M
ove up from the line until you see :-

PROCESS XX :
------------------------------------------------------------------------------------------------------------
-----------------

This is the PROCESS we are interested in. Search for the next occurance of this in the systemstate (so 'find' PROCESS XX) and check down to t
he wait line for that process (it will be in the same place relative to the PROC
ESS line). Make sure it says 'waiting' too (this tells us that the PROCESS has n
ot really move since the first systemstate.

Now, make a note of the OSPID of this process. To do this, about 15 lines down from the PROCESS line is a line like
:-

O/S info: user: SYSTEM, term: COLCHACWP0, ospid: 2052

So - in this case it is OSPID 2052

Now, start a sqlplus session as sysdba then:-

oradebug setospid X <--where X is the ospid of the hanging process
oradebug dump errorstack 3
(do this last command 3 times, one straight after the other)

This will write a trace file to user_dump_dest which will tell us what this process is doing.
-----------------------------------------------------------
The problem that we have is that we cannot reproduce it at will, as it occurs very randomly.

Thanks

August 05, 2005 - 10:45 am UTC

...
So - process 25 is doing something to revalidate the package -if it is spinning
of the CPU during that time then it may be related
to the fact we will having to contact the remote end via the DBLINK.....

so, is this package refering to remote objects and was the dblink in fact not avaiable?

similar experiences

Connor, August 05, 2005 - 12:45 am UTC

jobs would mysteriously cease to run when their scheduled time came up. Typically resolved by altering job_queue_processes back down to 0 and back up to previous setting.

Has settled down since we've been on the later 9.2 patchsets

hth
Connor

August 05, 2005 - 10:45 am UTC

(but the job is RUNNING here, otherwise i would agree)

Is network involved

Andrew Allen, August 05, 2005 - 9:07 am UTC

We have several jobs that perform m-view refreshes across database links. If, while the job is running the network drops or the remote database goes down, sometimes the local job stops running -- as if it is waiting for a response from the remote database that never comes. We solve this by setting the job queue processes to 0, killing the session (sometimes by killing the linux process), and then restarting the job queue and demanding the job again.

From the text of the job in question (...DOWNLOAD...), you may be having a similar problem.

Ashish, August 05, 2005 - 10:37 am UTC

Thanks Andrew and Connor.

Will try to simulate this by dropping the connection to remote database.

Hi Tom, Could you pls provide your feedback on Andrew and Connor's observations, in addition to your view?

Thanks

August 05, 2005 - 1:52 pm UTC

given your mention of dblinks (previously un-mentioned), it could likely play a role.

Fact:

jobs are not suspended, they are in fact *running*

So, it would not be what Connor mentioned, although if it is that they are stuck on a dead dblink (tcp-ip is not raising any error but the other end of the network connection is 'gone') the stopping of these jobs and restarting of them would shake it loose -- it would explain why "bouncing the instance" cleans them out

Ashish, August 05, 2005 - 2:29 pm UTC

Hi Tom,

We are using db-link to transfer data from a remote database. We never thought that network could be an issue because one of the procedures, not dependent on remote object also got suspended (all the procedures whether they connect to remote database or not belong to the same package CS_DOWNLOAD). But from today's discussion, I have got a direction. Will try to setup the testcase in test environment and will update if I manage to reproduce the issue.
But don't you think, this is something, which the database should manage itself?

Thanks

August 05, 2005 - 5:35 pm UTC

but there is also the talk of something going invalid and remote dependenceis and such.

The network, if it "stops responding" is beyond our control, if tcp-ip doesn't tell us "dead guy on the other end", well.

A reader, August 05, 2005 - 10:48 pm UTC

What does "INVALID (in the context of the shared pool)" ?

Does that mean that the package loaded in shared pool is invalid but when I query the status through USER_OBJECTS, it may display the status as "VALID"?

Is there a way to identify the INVALID packages/procedures in SHARED POOL.

August 06, 2005 - 9:26 am UTC

this is a package going through the process of getting invalidated, but it cannot because it is running and must wait for the already running copy to finish -- but new copies of it trying to run are blocked.

You might be able to just look at DBA_OBJECTS to see its status.

Metalink update

Ashish, August 18, 2005 - 1:23 am UTC

Hi Tom,
Just wanted to share the metalink update with you and to seek few conceptual clarifications.
<METALINK>
17-AUG-05 11:16:07 GMT
Ok - the systemstate shows PROCESS 17 to be one the CPU:-

last wait for 'latch free' blocking sess=0x0 seq=36011 wait_time=2
address=3331c968, number=62, tries=0
temporary object counter: 0

This is ospid: 2304, so the errorstack should tell me something...

PROCESS 17 is running SELECT COUNT(*) FROM JOB_PARAMETERS continuously (like , millions of times!)

Stack is:-

_ksdxcb+626 CALLreg 00000000 699B890 11 3 699B840 699B93C
_ssthreadruncallbac CALLrel _ksdxcb+0
k+53
..1.16_4.filter.81+ CALLrel _kdstgr+0 E3D1F64
578
_qergsFetch+379 CALL??? 00000000
..1.3_1.filter.30+5 CALL??? 00000000 36B1E588 554CA8 699BE38 2
62
_opifch+76 CALLrel _opifch2+0 89 5 699BF70
_opiefn0+314 CALLrel _opifch+0 4E 6 699C270
_opipls+2338 CALLrel _opiefn0+0 4E 6 699C270 0 0 0 0 0 0 0
..1.1_1.filter.34+1 CALLreg 00000000 66 6 699CA9C
356
_rpidrus.43+153 CALLrel _opiodr+0 66 6 699CA9C 42
_skgmstack+113 CALLreg 00000000 699C5F4

It is called off here:-

----- PL/SQL Call Stack -----
object line object
handle number name
36BC8228 201 procedure WEBC_DATA.SPCSBULKDOWNLOAD
36BD5AB0 1 anonymous block

Hi,

Ok - You need to look at WEBC_DATA.SPCSBULKDOWNLOAD specifically the call to select count(*) from job_parameters
and why is is running it continuously - if you look at the trace file you sent
me, it contains 1.6Gb of lines which are just continuous calls of this SQL. It
looks to be that some logic in the code in your app is the problem. I would sugg
est, if it is designed to spin continuously in the SQL call then at least some f
orm of small sleep is put in there as there is no need to run it hundreds of tim
es per second.

To give you an idea of the number of times it is called. I split the trace file up into 360M chunks. in the last of them I saw this SQL was run
1896909 times.
</METALINK>

To provide you the background, I followed the strategy as listed in thread ("Spawn Jobs from a Procedure that run in parallel")
</code> http://asktom.oracle.com/pls/asktom/f?p=100:11:::::P11_QUESTION_ID:10498431232211 <code>

But I forgot to add dbms_lock.sleep. From the update above, it seems it was a costly mistake. But why is dbms_lock.sleep required at all. Why not simply call

loop
select count(*) into l_cnt from job_parameters;
exit when (l_cnt = 0);
end loop;

How can and Why should calling it a million times lead to library latch contention? Wouldn't this query always be soft parsed?
Looking forward to your feedback to the questions asked and to as always many implicit questions, which we may not even be aware of.

Thanks

August 18, 2005 - 3:56 pm UTC

you are killing the system by executing that query billions of times. the sleep was there so we can benchmark *something else*, not benchmark how many times we can call this query.

Ashish, August 19, 2005 - 12:56 am UTC

I agree its running it a billion times.
I agree with you in principal but few doubts still remain
Given the fact that
- Table is very small and is cached (maximum of 9 rows at any point of time. 2 column table)
- Query is always soft parsed. Logical I/O and physical I/O are zero
- The Code is Sequential i.e the second iteration will not commence before the completion of first iteration.
Why should it then hog so much resources?

I know, it would be difficult for you to comment in absence of trace files. Just taking a chance, if you can provide some information.

August 20, 2005 - 3:30 pm UTC

logical io cannot be zero.  That would be impossible here if job_parameters is a real table.

logical io consumes latches (lots of them).  parsing consumes latches.  


ops$tkyte@ORA10G> create table t ( x int );
 
Table created.
 
ops$tkyte@ORA10G> insert into t values ( 1 );
 
1 row created.
 
ops$tkyte@ORA10G> commit;
 
Commit complete.
 
ops$tkyte@ORA10G>
ops$tkyte@ORA10G> exec runstats_pkg.rs_start;
 
PL/SQL procedure successfully completed.
 
ops$tkyte@ORA10G> declare
  2          l_cnt    number;
  3  begin
  4          for i in 1 .. 100000
  5          loop
  6                  select count(*) into l_cnt from t;
  7          end loop;
  8  end;
  9  /
 
PL/SQL procedure successfully completed.
 
ops$tkyte@ORA10G> exec runstats_pkg.rs_middle
 
PL/SQL procedure successfully completed.
 
ops$tkyte@ORA10G> begin
  2          null;
  3  end;
  4  /
 
PL/SQL procedure successfully completed.
 
ops$tkyte@ORA10G> exec runstats_pkg.rs_stop(10000)
Run1 ran in 680 hsecs
Run2 ran in 10 hsecs
run 1 ran in 6800% of the time
 
Name                                  Run1        Run2        Diff
STAT...table scans (short tabl     100,001           0    -100,001
STAT...table scan rows gotten      100,001           0    -100,001
STAT...execute count               100,007           5    -100,002
STAT...recursive calls             100,022           2    -100,020
LATCH.shared pool                  100,155          80    -100,075
LATCH.library cache pin            200,087          45    -200,042
LATCH.library cache                200,166          84    -200,082
STAT...calls to get snapshot s     300,006           1    -300,005
STAT...table scan blocks gotte     500,005           0    -500,005
STAT...no work - consistent re     500,006           0    -500,006
STAT...session logical reads       700,032          32    -700,000
STAT...consistent gets             700,018          11    -700,007
STAT...consistent gets from ca     700,018          11    -700,007
LATCH.cache buffers chains       1,400,103         109  -1,399,994
 
Run1 latches total versus runs -- difference and pct
Run1        Run2        Diff       Pct
1,901,229         391  -1,900,838##########%
 
PL/SQL procedure successfully completed.


See, quite a bit of work to do that.

Thanks

Ashish, August 21, 2005 - 11:36 pm UTC

Thanks Tom, for very lucid explaination.

I've had similar problems

Phil Singer, September 06, 2005 - 3:46 pm UTC

DBMS_JOBS, Version 9.2 Jobs will work for months, and then suddenly spin forever until killed and restarted (following the procedure in the DBA Administrator's guide). This seems to happen when either the SMTP service is down (three of them send mail) or the database link is dead (two query a database link (one does both)). Have not been able to detect any reasons why this would happen; keep hoping this will happen on a Saturday morning, so I can spend lots of time troubleshooting. Jobs remain 'active' (but burning lots of CPU)

Breadcrumb

Question and Answer

You Asked

and Tom said...

Rating

Comments

similar experiences

Is network involved

Metalink update

Thanks

I've had similar problems

More to Explore

PL/SQL demos

PL/SQL docs

Some of the DBMS_JOB jobs remain in suspended stage

Breadcrumb

Question and Answer

You Asked

and Tom said...

Rating

Comments

similar experiences

Is network involved

Metalink update

Thanks

I've had similar problems

More to Explore

PL/SQL demos

PL/SQL docs