JDBC TIMEOUT and LOCKWAITS
Shyamakanta Chaudhury, February 20, 2008 - 5:45 am UTC
Many Thanks Tom.
I agree, the sequence uses noorder.
Please find below some database queries that usually encounters the lock and Timeout problem.
The same queries are called from the TIBCO application for the business processing.
alter table lalu storage (buffer_pool keep);
alter table lalu storage (buffer_pool default);
alter table lalu cache;
alter table lalu nocache;
DROP SEQUENCE ERECHARGE.E_RECHARGE_TRANS_SEQ;
CREATE SEQUENCE ERECHARGE.E_RECHARGE_TRANS_SEQ START WITH 1 INCREMENT BY 1 MINVALUE 1 CACHE 250 NOCYCLE NOORDER ;
1:SELECT *
FROM tab_dealer_money_master
WHERE dealer_id = TRIM (:b2) AND balance_type = 'CORE' AND master_id =
:b1
FOR UPDATE
Operation Object Name Rows Bytes Cost TQ In/Out PStart PStop
SELECT STATEMENT Hint=CHOOSE 1 3
FOR UPDATE
TABLE ACCESS BY INDEX ROWID TAB_DEALER_MONEY_MASTER 1 46 3
INDEX UNIQUE SCAN IDX_DEALER_MASTER 1 2
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 0 0.00 0.00 0 0 0 0
Execute 385 0.19 0.79 3 1548 781 0
Fetch 385 0.00 0.00 0 0 0 385
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 770 0.19 0.79 3 1548 781 385
Misses in library cache during parse: 0
Optimizer goal: CHOOSE
Parsing user id: 962 (recursive depth: 1)
Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
global cache cr request 129 0.11 0.49
global cache open x 38 0.00 0.02
global cache null to x 89 0.00 0.06
db file sequential read 3 0.00 0.00
global cache s to x 1 0.00 0.00
latch free 2 0.01 0.02
********************************************************************************
2:/* Formatted on 2008/02/20 15:35 (Formatter Plus v4.8.0) */
UPDATE tab_dealer_money_master
SET total_debit = NVL (total_debit, 0) + (NVL (:b5, 0) + NVL (:b4, 0)),
wip_debit = NVL (wip_debit, 0) - (NVL (:b5, 0) + NVL (:b4, 0)),
last_updated_on = SYSDATE
WHERE dealer_id = :b3
AND master_id = :b2
AND balance_type = DECODE (:b1,
'ETRC', 'CORE',
'PTRC', 'PROMO',
'INVALID'
)
Operation Object Name Rows Bytes Cost TQ In/Out PStart PStop
UPDATE STATEMENT Hint=CHOOSE 1 3
UPDATE TAB_DEALER_MONEY_MASTER
INDEX UNIQUE SCAN IDX_DEALER_MASTER 1 38 2
3:INSERT INTO tab_e_recharge_trans
(trans_id, trans_date, trans_type, donor_master_id, donor_mdn,
recharge_amount, recepient_mdn, recepient_masterid,
talktime_given, out_status, status_in_update, last_updated_on,
curr_status, cos_code, validity_days, core_bucket_name,
external_ref, source_id, dealer_id, sec_cos,
hst_type, dffrntl_amt, recipient_circle_abbrev
)
VALUES (:b21, SYSDATE, 'ETRC', :b20, :b19,
:b18, :b17, :b16,
:b15, :b14, NULL, SYSDATE,
:b13, :b12, TRIM (:b11), TRIM (:b10),
TRIM (:b9), NVL (TRIM (:b8), :b7), NVL (:b6, TRIM (:b5)), :b4,
TRIM (:b3), :b2, :b1
)
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 0 0.00 0.00 0 0 0 0
Execute 386 1.20 47.49 1122 816 16028 386
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 386 1.20 47.49 1122 816 16028 386
Misses in library cache during parse: 0
Optimizer goal: CHOOSE
Parsing user id: 962 (recursive depth: 1)
Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
db file sequential read 1122 1.03 23.55
buffer busy global cache 121 0.25 4.58
buffer deadlock 139 0.00 0.00
buffer busy waits 195 0.75 9.84
global cache open x 736 0.00 0.40
enqueue 33 0.49 2.71
global cache null to x 141 0.27 2.93
global cache cr request 56 0.00 0.04
global cache s to x 33 0.12 0.14
global cache busy 29 0.24 1.82
latch free 23 0.01 0.07
global cache open s 11 0.00 0.00
log file sync 3 0.14 0.15
global cache null to s 1 0.00 0.00
********************************************************************************
And there is arround 20Million inserts into the TAB_E_RECHARGE_TRANS table.
global cache cr request shows a high value in the trace.
Please help me out tuning the db taking into consideration of above queries.
Thanks.
Shyamakanta.
February 20, 2008 - 9:09 am UTC
what is up with this:
alter table lalu storage (buffer_pool keep);
alter table lalu storage (buffer_pool default);
alter table lalu cache;
alter table lalu nocache;
can't you make up your mind?
1) that select doesn't see to be the possible cause of any timeout issues.
ops$tkyte%ORA10GR2> select .79/385 from dual;
.79/385
----------
.002051948
each individual select was about 2/1000th of a second.
2) no comment, nothing to look at...
3) IO is the big guy here
ops$tkyte%ORA10GR2> select 23.55/1122 from dual;
23.55/1122
----------
.020989305
could be considered "not exceptionally fast" - however, each update took about
ops$tkyte%ORA10GR2> select 47.49 / 385 from dual;
47.49/385
----------
.123350649
but most (50%) of that time is spent on IO.
Looks like you might have lots of indexes on that table you are inserting into - we need the leaf blocks in memory in order to insert - and that is the physical IO (guessing, you'd need to research that, v$segment_statistics might be useful). Hence, it is not the table that is causing the IO, but the index maintenance
"LOCKWAITS AND TIME OUT (FROM TIBCO APPLICATION)
Shyamakanta Chaudhury, February 20, 2008 - 12:41 pm UTC
Hello Tom,
Many thanks for the update.To be honest your seggestions make much sense to me.
I have the last question for you.
While i had a single instance(SUN E6900), everything was running well and good.
I rarely used to get these lockwaits(once in a day) and absolutely no JDBC TASK TIMEOUT errors.
The CPU used to be 70% free during the busy working hours.
So i guess there is some abnormality/compatibility with the application after the RAC implementation.
Also there is a high value for "global cache cr request" in the sql trace from the TIBCO session.
The initial issues came up with the sequence generation process.Encountered lock waits on sequence generation sql statements on table TAB_E_RECHARGE_TRANS.
As per your opininon of high IO in the system, will it be good to cache the tables/indexes?
Thanks.
Shyamakanta.
February 20, 2008 - 1:15 pm UTC
... So i guess there is some abnormality/compatibility with the application after
the RAC implementation. ...
or you are doing more users
or there was an application change
or your IO is not double buffered by the OS anymore
I would go for the last one come to think of it. I'll bet that before your files where regular files on a regular (BUFFERED) file system. So, in the past you were always doing as many physical IOs as you are now - but they were read from the OS file system cache. Now, because you have to be using non-buffered IO devices, you are not.
EG: you were using the OS as a secondary SGA in the past. Now you are not.
Which would imply "you need to relook at your buffer cache settings, they might need to increase to be as large as they were PLUS however much of the OS memory was being used for a file system cache before"
see for example:
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:7413988573867 shows how 13k IOs can go from taking many many seconds to taking one second - the effect of the OS file system cache.
And if that is what is happening here, you need to look at your SGA sizing, it might have to increase to be at the same level of data caching as it was before (only now oracle is doing all of the caching)
the global cache request stuff is such a blip on the radar screen compared to your IO right now, it is *noise*. The value is not too high, the value is what it is.
Use statspack or AWR and the buffer cache advisor to see what it says - to see what the optimal cache size would be (and that works best if you just have the default cache, you complicate it with keep and recycle pools)
And let things cache themselves - if you use them, we'll cache them. If you don't, we won't.
Review doc
Bala Palani, February 20, 2008 - 2:24 pm UTC
February 20, 2008 - 2:43 pm UTC
ugh - do you see that it is at the level of *noise* right now? and the relative impact per query is nominal?
My suggestion is based on..
A reader, February 21, 2008 - 12:50 pm UTC
Tom,
My suggestion is based on..
1. The trace file updated by the user is not complete.
2. "While i had a single instance(SUN E6900), everything was running well and good."
3."there is a high value for "global cache cr request" in the sql trace from the TIBCO session."
4.We can't expect any OLTP application to scale without changes, moving to a RAC.
From oracle doc:
http://download.oracle.com/docs/cd/B10501_01/rac.920/a96598/design.htm#1024997 "For example, consider a table that has a high insert rate which also uses a sequence number as the primary key of its index. All sessions on all nodes access the right-most index leaf block. Therefore, unavoidable index block splits can create a serialization point that results in a bottleneck. To resolve this, rebuild the table and its index, for example, as a 16-way hash partitioned object. This evenly distributes the load among 16 index leaf blocks."
I do notice the disclaimer :)
"These techniques also apply to single-instance environments; they are not specific to Real Application Clusters."
However, there will be effects of cache fusion, ofcourse much better than the 8i OPS block pinging...
February 21, 2008 - 4:55 pm UTC
and given that the global cr stuff is noise, there are infinitely larger fish to fry based on given information.
I choose this thread:
...
... So i guess there is some abnormality/compatibility with the application after
the RAC implementation. ...
or you are doing more users
or there was an application change
or your IO is not double buffered by the OS anymore
I would go for the last one come to think of it. I'll bet that before your files where regular files on a regular (BUFFERED) file system. So, in the past you were always doing as many physical IOs as you are now - but they were read from the OS file system cache. Now, because you have to be using non-buffered IO devices, you are not.
EG: you were using the OS as a secondary SGA in the past. Now you are not.
Which would imply "you need to relook at your buffer cache settings, they might need to increase to be as large as they were PLUS however much of the OS memory was being used for a file system cache before"
.......
LOCKWAITS AND TIME OUT (FROM TIBCO APPLICATION)
Shyamakanta Chaudhury, February 23, 2008 - 7:51 am UTC
Hello Tom,
Some quick answers to your questions:
1:There is neither application change nor load/user increase in the database.
2:The database filesystems are on VCFS on top of VXVM.
3:I had a chat with the system admin/network person and
found that "CHip port utlization for both servers is 45 %".
4:I am told that 'There is no file system buffering in VCFS'.
5:Your suggestion of reducing less used indexes is really impressive and we are planning for the same.
6:Moving to table partitioning is not passible at this point of time.
Thanks.
Shyamakanta.
February 24, 2008 - 11:46 am UTC
3: what the heck is CHip port utilization
4: I know, I told you that :)
did you get my point about BUFFER CACHE LIKELY NEEDS TO BE LARGER as you USED TO USE the OS to double buffer and now you are not.
LOCKWAITS AND TIME OUT (FROM TIBCO APPLICATION
A reader, February 27, 2008 - 7:35 am UTC
Hi Tom,
While on single instance the sga was 4GB and after RAC it was increased to 14GB.So also Shared pool from 700MB to 1GB.
Do i need to increase it more?
Thanks.
Shyamakanta.
February 27, 2008 - 11:15 am UTC
i'm only interested in the buffer cache.
what was the total size of ram in the system before rac. what was the buffer cache before.
what is the total size of ram on a single node after rac. what is the buffer cache in a single node after rac.
are your IO times "increased from what they used to be", do you see physical IO taking longer now.
LOCKWAITS AND TIME OUT (FROM TIBCO APPLICATION)
Shyamakanta Chaudhury, February 28, 2008 - 6:48 am UTC
Hi Tom,
The total RAM is 96GB on the node1 and cache size was 4GB before RAC.
Total RAM on Node2 is 48GB.
After migration to RAC, the database came to a stand still with same buffer cache so was increased to 12GB.
No, i dont find any abnormality with physical IO.Any method to check?
Thanks.
Shyamakanta.
February 28, 2008 - 3:49 pm UTC
so the original 'buffer cache' before rac was 96gb. (the OS buffered data, the SGA buffered data)
now it is much less, less than 14gb.
do you see what I'm saying here?
I'm not saying you'll find anything abnormal IO - just that before it was probably 'faster' since many/most of the IO's were satisfied out of the OS file system cache and you've made that go away with RAC.
do you have metrics that show the original physical IO avg response times? Have they diverged from the CURRENT physical IO response times.
LOCKWAITS AND TIME OUT (FROM TIBCO APPLICATION
A reader, March 08, 2008 - 6:06 am UTC
Hi Tom,
The buffer cache was not 96GB, it was the total RAM.
The buffer cacahe was less than 5 GB.
As per your suggestion, i found bottleneck in the filesystem/storage.The response time is quite high on some disks.Now I am in the process of moving the highly used tables and indexes into less used disks.
Thanks.
Shyamakanta.
March 10, 2008 - 11:21 am UTC
you are missing my subtle point
before the buffer cache was effectively 96gb - because the OS double buffered - you had a secondary SGA. Your physical IO's were not true PHYSICAL IOS - they were from the file system buffer cache - hence your IO's were "apparently faster" before - because of this double buffering.
actually, i was not subtle, I sort of said that:
...
so the original 'buffer cache' before rac was 96gb. (the OS buffered data, the SGA buffered data)
.....
LOCKWAITS AND TIME OUT (FROM TIBCO APPLICATION)
Shyamakanta, March 10, 2008 - 12:13 pm UTC
Hi Tom,
We found high response time, in some cases > 100ms, where as <20ms is good as per oracle recomendation.
I distributed the load, moved the hihgly used table and its indexes to new mount points/disks.
Please find the comments from oracle support on SR-18582136.6
"The data collected from the statspack shows that there is some major IO issue.
Look at the Avg Rd(ms) for the tablespaces pasted above. The avg rd is too high.
The thershold that is set by oracle is 20 avg rd(ms). The value is so high in your case
that it is showing #### there."
After the load distribution, the average response time for these idle mount points got increased leaving very less benefit.
Please find response from the datacenter and storage vendor
"We have already liog the case with HP . And Hp has verified that there is no problem on storage.
service time for disks will be high as high read /write operations on disk.
root@P0111CRMDB #
OPERATIONS BLOCKS AVG TIME(ms)
TYP NAME READ WRITE READ WRITE READ WRITE
vol tibco 900305 2897992 10648178 15534475 6.4 7.9
vol u03data 692570107 1305885726 2547348992 3810053982 3.5 32.0
vol u14data 543040897 320617578 3813579816 1570613 20.1 31.3
vol u15index 582783628 752923602 3449427870 3631674129 18.4 47.6
vol u22 270133744 257056932 3051927401 4249480253 23.8 44.1
vol u23 637332 392378 16431257 6607922 19.4 15.8
vol u24 29258500 67747156 771394297 1111616380 20.1 37.1
vol u25 31030 279456 217177 4550686 4.5 8.0
vol vol01 386831 12054378 13851968 3368654704 10.7 46.6
vol vol02 869906774 445868414 3567592400 2859980786 16.6 40.6
vol vol03 151841001 272585961 2824709440 122360218 19.5 54.1
"
I have planned to increase the db buffer size.
Some quick questions for your attention:
i)Taking into consideration of high response time from the disks, will it be beneficial of caching table tab_dealer_money_master(700 MB) and its indexes into the database buffer?
Its a read intensive table.
ii)Caching the insert intesive table table_e_recharge_trans into the buffer cache?
iii)The current log_buffer=16384000
is it good for the current scenario?
Thanks.
Shyamakanta.
March 10, 2008 - 12:20 pm UTC
1) that will happen NATURALLY - the buffer cache, when given the space, caches that which makes sense to cache.
2) see #1
3) totally not relevant, you are not waiting on the log buffer, nothing here to indicate that you are anyway.
96 GB RAM-should buffer_cache be that high?
suresh, March 11, 2008 - 3:58 am UTC
Dear Tom,
I find your response here to be 'at variance'-maybe diametrically opposite to your usual views on tuning.
The Original Poster had a really 'good' box earlier 96GB RAM which was O/S buffering. Now, with RAC, he has lost that benefit, and all physical I/O as reported by oracle are real physical I/Os. this much makes sense.
However, In your solution you are asking him to increase buffer_cache to what was before which is very high to start with..:) your usual thoughts on tuning is to see why how to reduce the blocks touched, here you are advocating very different approach!
March 11, 2008 - 7:18 am UTC
No, I did not say put the buffer cache at 96gb.
I said - BEFORE RAC, you in effect had a cache upto 96gb. You now do not. Please take that INFORMATION into consideration, you might have to increase your true buffer cache to account for the missing SECONDARY CACHE you used to have.
I did not say "make it 96gb", I said "before it could have been in effect 96gb, please bear that in mind, you might have to increase yours".
They were trying to get back to where they were.
I do not say "do not have big caches".
I do say "do not increase the cache willy nilly and just dream of super performance - it doesn't work that way"
LOCKWAITS AND TIME OUT (FROM TIBCO APPLICATION
A reader, March 20, 2008 - 8:38 am UTC
Dear Tom,
I had done some changes in the nevironment, and the outcome is some thing as below:
1:Removed the failover and load balancing from the RAC environment.Some applications are routed to instance1 and some to instance2.
To my surprise, the lockwaits during insert into tab_e_recharge_trans ot select from the tab_dealer_money_master has come down significantly.
More over JDBC timeouts has also come down significantly.
But at this point of time i can't enjoy RAC failover feature.
Thanks.
Shyamakanta.
March 24, 2008 - 10:26 am UTC
and lo and behold, you still have no idea what you are waiting on - why things are slow, you don't even know if they are significantly faster or if the change you made had anything to do with it at all.
I'll keep suggesting: find out what you are waiting on
JDBC TASK TIMEOUT.
lalu121212, July 29, 2008 - 1:08 am UTC
Hi Tom,
We followed some excercise and we got much benefit:
1:Pkg/Proc called frequently and thoese are called when timeout occurs are cached in shared pool.(dbms_sharedpool.keep)
2:One more application engine was introduced so that requests are spawned.
And now the error has got vanished.So may be this was an application issue but still this is not tested on RAC as application is routed to one single node.
One of the reviewer commented something like "The Original Poster had a really 'good' box earlier 96GB RAM which was O/S buffering. Now, with
RAC, he has lost that benefit, and all physical I/O as reported by oracle are real physical I/Os. "
Is it default that OS buffering is there for single node and absent in a RAC system?
Can you highlight more about it?
Thanks.
July 29, 2008 - 3:48 pm UTC
1) you do not need to keep something you call frequently, that is what a cache DOES, caches frequently used stuff...
you use a different type of file system with rac. An unbuffered one. Typically, many people not using rac just use a plain old file system - they are buffered.
When they went rac, they "did not account for the fact they used this secondary cache, they had to account for that by increasing the SGA appropriately"
RAC and Distributed database.
lalu., June 12, 2009 - 6:18 am UTC
Hi Tom,
Its related to a data/load increase in a RAC system.
We have a 2 node RAC system running small business over a small geographical region.
Currently we are experiencing high response times as we have added some more applications to use the same db.
The application team want to introdue a distributed concept and do a application segregation.Some of the tables or objects are used by multiple applications so there is the need of data sync. up (Oracle Advance replication or streams) with different sites.
As per my thinking as we are dealing with a small geographical region adding one more node into the RAC is a better option than introducing the distributed database.
we can get rid of extra storage costs and data sync. up overhead.
What are the ideal situations to use distributed databases?
Thanks.....
lalu.......
June 12, 2009 - 10:13 am UTC
if you think it is slow now - go for distributed transactions and queries and synchronization. That'll be a sure way to make it even worse
How about we start with IDENTIFYING WHAT THE BOTTLENECK IS first - and then apply corrective action.
It might not be adding another node, it might be correcting a hugely inefficient algorithm.
but you have to start with "what is wrong" before you can "apply a fix"