Sleep count
Michel CADOT, September 07, 2007 - 2:06 am UTC
Is not the line
Sleeps on Latch = Sleeps + 1
after the spinning loop that just before or after the line
Go to sleep for short period
That is does it count the number of spins or the number of real sleeps (or maybe a spin is a "sleep")?
Regards
Michel
September 11, 2007 - 8:50 am UTC
you are correct, not sure how the sleep increment got before the loop
thanks:
Attempt to get Latch
If Latch gotten
Then
return SUCCESS
Else
Misses on that Latch = Misses+1;
Loop
For I in 1 .. 2000
Loop
Attempt to get Latch
If Latch gotten
Then
Return SUCCESS
End if
End loop
Sleeps on Latch = Sleeps + 1
Go to sleep for short period
End loop
End if
Look out for....
Ian, September 07, 2007 - 4:32 am UTC
On 9i versions less than 9.2.0.8 there is a nasty bug to look out for whereby the spin_count is multiplied by the number of CPUs. So on our 42 CPU system we were spinning 84,000 times instead of 2,000 times! Not good.
The workaround (once you have confirmed with Oracle that you are affected) is to set _spin_count to 2000/Number of CPUs.
See Note:4696143.8 on Metalink.
measuring cpu cycles consumed due to spin
Nilanjan Ray, March 20, 2008 - 7:10 pm UTC
Hi Tom,
I suspect that our system in affected by too much LIO on an inflated buffer cache and too much keep pool size. This results in too much spinning on cache buffer chain latch. The CPU utilization goes up with the run queue increasing.
Since most of the time the latch is obtained by spinning, nothing is visible in the wait interface but the high CPU utilization gives a wrong impression. The spin gets coincides well with the high CPU utilization and run queue being built up.
So far so good, I am convinced (may be wrongly). The question which I really want to ask is: Is it possible to prove with any oracle stats, that indeed the spinning is the cause of performance bottleneck. Any metrics or any derivation from metrics in Oracle 10g. This is crucial because a belief can be substantiated with proper data and proof.
Many thanks
Ray
clarification on sleep time
Sravan Tailam, March 25, 2009 - 10:15 am UTC
Tom, you mentioned :
------
it tried to get the latch, but failed (because someone else had it). So, it goes to sleep (waits on a latch free) and wakes up and tries again. The time it was asleep - that is the wait time for "latch free"
------
Can you please confirm if the time shown in statspack for a latch free wait is sleeping time or spinning time or both ?
In other words, if my Top 5 events is something like :
Top 5 Timed Events
~~~~~~~~~~~~~~~~~~ % Total
Event Waits Time (s) Ela Time
-------------------------------------------- ------------ ----------- --------
CPU time 14,073 92.03
latch free 130,315 401 2.62
enqueue 5,085 275 1.80
db file scattered read 815,104 212 1.39
db file sequential read 1,168,259 73 .48
========
Is the 401 seconds the time spent in sleeping only?
If thats the case, how do I calculate the time spent in spinning? So, does the spinning time be rolled into the CPU time ?
March 29, 2009 - 8:20 pm UTC
that is sleeping only, goto the latch report in the same file and review the MISSES and SPINS.
You cannot calculate the time spent in spins (it is some part of the CPU time) only that you spun.
Krzysztof W¿asiuk, March 30, 2009 - 3:11 am UTC
In the above Statspack report header the number of latch waits is quite small in comparison to db block usage in my opinion. Are You sure there is nothing else witch consume CPU time? Some ugly query maybe?
March 30, 2009 - 4:17 pm UTC
the number of latch waits is the TIP OF THE ICEBERG.
In order to get to a latch wait, you spun in a hard loop 2,000 times trying to get the latch. Here is a simplified "latch algorithm", it looks much like this:
Loop
for I in 1 .. 2000
loop
try to get latch
if got latch, return
if I = 1 then misses=misses+1
end loop
INCREMENT WAIT COUNT
sleep
Add WAIT TIME
End loop;
so, every miss = CPU time. If and only if you miss it 2000 times in a row will you wait. So, 103,000 times you waited - that means 103,000 times you spun 2000 times burning up cpu. The MISSES in the latch report will tell you how many times altogether (probably in the millions) you spun - but spun less than 2000 times.
Did I ever say "nothing else consumes cpu"?? I didn't say that at all.
I said "go to the latch report, review the number of MISSES and understand they all eat CPU"
of course other things are consuming CPU. That is a given.
the question was
If thats the case, how do I calculate the time spent in spinning? So, does the spinning time be
rolled into the CPU time ?
Latch Free on enqueue
Upendra Satluri, May 04, 2009 - 10:39 pm UTC
Tom,
I dont know if this is a new question or relevant to the thread, but I see a lot of "latch free" waits in our system, and the P2 is enqueue. I kind of get stuck there. What do I check next in case it is an enqueue. I checked my statspack for the "sleepers", and this is what I find.
NoWait Waiter
Latch Name Where Misses Sleeps Sleeps
------------------------ -------------------------- ------- ---------- --------
enqueues ksqdel 0 2,267,369 ########
enqueues ksqgel: create enqueue 0 2,254,960 ########
I cant make anything out of this, but these 2 guys seem to be causing the bulk of the sleeps. Can you throw some light on this, or should this have gone to a different forum?
May 09, 2009 - 10:37 am UTC
enqueues are locks - row locks typically.
do you see large enqueue waits as well?
any unindexed foreign keys out there throwing full table locks "by accident"?
A number of sleeps is by itself "not useful", time periods (duration of report), information on activity - they are (needed in addition to a number)
Atomicity of "test and set"
Prajjwal Mallik, May 11, 2009 - 12:10 pm UTC
Hi Tom,
What a wonderful bit of information!
However, I have a silly question regarding this statement "Oracle uses atomic instructions like 'test and set' and 'compare and swap' for operating on latches"
As per my understanding, oracle when locks a set of rows (say) from one session, the lock information is stored in the data blocks(in the header) tagged with the transaction ID. Now suppose two sessions are simulateously requesting for a lock on an overlapping set of rows, the situation may be like this:
step1 ) session 1 >>> "test" row #5 (say) if it is locked by some other transaction. (At this moment, the row #5 is found not locked. Also, session 1 does not lock it now)
step2 ) session 2 >>> "test" row #5 (say) if it is locked by some other transaction. (But at this moment, the row #5 is not yet locked by session 1)
step3 ) session 1 >>> "set" row #5 as locked by the transaction in session 1
step4 ) session 2 >>> "set" row #5 as locked by the transaction in session 2 (It does not know session 1 has already locked the row #5. It only relies on the information from step 2)
My question is, with what construct does oracle ensure the serialization of the flow of steps to be
Step1 then Step 3 then Step 2 then Step 4
or
Step 2 then Step 4 then Step1 then Step 3 ?
I believe "test and set" as a whole has to be atomic. If in case this is the answer you give me, I have yet another question. :)
Digging down to the lowest level, test and set can never be executed in a single CPU instruction if I am not wrong. Then is the big question, how is "Test and set" atomic as a whole?
May 11, 2009 - 7:03 pm UTC
we do not use latches to lock rows. re-read the quote above about what latches are and what they are used for.
In order to access a row to modify it, you get the block the row is on in CURRENT MODE. Only one session can get a block in CURRENT MODE - it uses a latch to serialize access to the entire block, looks at the row it wants (it is the ONLY thing that can look at this row in CURRENT MODE), and if it is not already locked - it locks it (by updating the transaction information in the block header). If it is locked, it notes that. Either way, it immediately unlatches the block now (someone else can look at it in current mode). If it locked the row - it continues on. If it could not lock the row because someone else had it - it creates an ENQUEUE LOCK (heavy duty) on the session that has it locked - when that session commits - they get a chance to get at the row again.
... Digging down to the lowest level, test and set can never be executed in a
single CPU instruction if I am not wrong. ...
sure it can - that is what a test&set is - a CPU instruction, the CPU provides this for us (most do, HP's do not for example :( there we have to use heavier devices to serialize)
ther's a specific Oracle feature
A reader, May 11, 2009 - 12:16 pm UTC
try use
dbms_lock.sleep( 5 );
follow - up
A reader, May 11, 2009 - 12:33 pm UTC
The actual wait time for the "latch wait" is huge, and that was what got me interested. There seems to be a lot of conversions/releases and requests. The system may be just too busy. Tried looking for the individual Enqueue waits, but none were alarming. One of my collegues was thinking this may be because of a Remote transactions the process does, and how that effects the enqueues for the local transactions.
Here are the Top 5 from the awr:
Event Waits Time (s) (ms) Time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
db file sequential read 42,005,780 232,838 6 73.8 User I/O
CPU time 124,672 39.5
latch free 11,592,952 76,746 7 24.3 Other
wait list latch free 107,553 2,088 19 0.7 Other
latch: library cache 1,135,976 1,862 2 0.6 Concurrenc
-------------------------------------------------------------
Here are the sleep numbers from the awr report:
Latch Name
----------------------------------------
Get Requests Misses Sleeps Spin Gets Sleep1 Sleep2 Sleep3
-------------- ----------- ----------- ---------- -------- -------- --------
enqueues
164,661,022 67,359,830 11,487,083 ########## 0 0 0
library cache
390,477,596 14,128,161 1,136,176 ########## 0 0 0
library cache pin
260,126,896 6,223,864 831,838 5,398,122 0 0 0
cache buffers chains
1,262,613,116 2,602,885 429,898 2,175,706 0 0 0
transaction branch allocation
82,282,012 1,600,047 95,477 1,505,151 0 0 0
Would altering the spin_count help with the LATCH FREE waits? It would seem logical because now i will spin longer, and may not have to sleep, so will have lesser latch free waits.
Does it make sense to close the DB Link after doing a remote transaction, so the local transactions are treated differently, and not so manyh enqueues are needed?
Thanks for the reply, Tom.
May 11, 2009 - 7:08 pm UTC
... The actual wait time for the "latch wait" is huge ..
a) I had no way to see or verify that
b) no, it is not. not relative to IO waits anyway.
... Would altering the spin_count help with the LATCH FREE waits? ....
DO NOT EVEN CONSIDER THAT. Unless support tells you to - do not touch it, you can so completely destroy things with that.
Sure, you'll spin longer - but - what if your degree of concurrency is such that the latch holder is getting kicked off the cpu (they ran out of their time slice)... do you REALLY want the latch requesters to actually *spin longer* preventing the latch holder from getting onto the cpu again?!?!?!?!
... Does it make sense to close the DB Link after doing a remote transaction ...
no, if you don't access the link, it won't be a distributed transaction.
any unindexed foreign keys out there throwing full table locks "by accident"?
Follow - up - Latch Free.
Upendra Satluri, May 11, 2009 - 12:37 pm UTC
The previous post was from me. I guess i didnt fill in the details in a hurry.
Thanks
Prajjwal Mallik, May 12, 2009 - 11:01 am UTC
Thanks a lot Tom for the clarifications. :)
Enqueue waits
Upendra Satluri, May 13, 2009 - 11:04 am UTC
Thanks Tom for making things totally clear.
... Does it make sense to close the DB Link after doing a remote transaction ...
no, if you don't access the link, it won't be a distributed transaction.
What I meant was, since the dblink remains open till the time the session remains open, does it make sense to close the DB Link with the alter session close database link command? Since there is a session open on the target, does oracle treat all statements coming from the master as transactions (including select statements, may be?). So, does it make sense to actually close the db link with the alter statement, so as to reduce the number of enqueue requests. Does it actually matter at all ?
And I woudnt touch the spin_count without getting the approval from oracle support anyways (I have a job to keep), just wanted an idea if I should even think about such a thing. I got my answer.
I guess I am diverting from the main topic of Latch Free, which was what this forum was for, so will not mind if you dont want to answer this, but will surely appreciate if you would.
Thanks again Tom. Very helpful, as always.
May 13, 2009 - 3:02 pm UTC
... does
oracle treat all statements coming from the master as transactions ...
no, it doesn't
you'd want to close the link if
a) you were not using shared database links
b) you wanted to reduce the resources consumed ON THE REMOTE site - each session would be taking up a bit of resource over there.
Thank you for the wonderful insight
Upendra Satluri, May 14, 2009 - 11:28 am UTC
Thanks Tom.
Got all my answers.
Also, may be because of the limitation you say with the HP Unix CPUs (where test and set is not a single CPU instruction), maybe, just maybe, enqueue latch is a solitary latch in HP (or is it a solitary latch everywhere, latch# 18). May be this is how oracle ensures serialization? OR am i totally confused?
May 14, 2009 - 12:09 pm UTC
no, it would not make us use a solitary latch (whether you have ONE or THOUSANDS - the logic of latching is "serialize the latch get" in general. Just having one latch would not 'fix' anything)
On hardware that don't have a test and set, we use something heavier like a semaphore - which the OS provides for serialization.
Thanks again - Tom
Upendra Satluri, May 14, 2009 - 2:33 pm UTC
Thanks Tom. A good learning experience.
Will keep checking the system to see if I find something, but most likely is just the way the application is designed. Management wants to run more than 200 application processes, and as soon as we increase the number to 150, I start seeing these "Latch Free" events and the total throughput starts reducing.
I was able to eliminate some of the "concurrency" waits by partitioning some indexes and increasing the pctfree for some of the index blocks, but cant seem to handle these "other" waits, as awr reports these.
But thanks Tom, very useful definitions and insights. At least now I know where NOT to look, which is important too.
Clarification
Claus, May 15, 2009 - 7:39 am UTC
As discussed, the implementation of latches depends on the CPU architecture. HP-UX runs on two CPU architectures PA-RISC and Itanium. PA-RISC does not have an atomic test-and-set instruction, Itanium does.
May 15, 2009 - 1:47 pm UTC
correct
More of latches on 10.2.0.3
Sidney Pryor, April 23, 2010 - 11:45 am UTC
Hi Tom,
Reading your answers is always a great way to know Oracle better.
I read this question from its begining and some other questions came up on my mind. We are starting to see some events of wait for latch on our 10.2.0.3 databases. So here they go:
1)as latches are semaphores controling access to memory blocks, if we have waits for it, does it means that we have few memory blocks free, that we have high concurrency for some specific hot blocks, OR that we have slow DMLs operations?
2)are latches configuration an operational system (OS environment) issue or limitation?
3)what is the best way to find the source of latch waits?
4)are latches issues associated with application logic too and how can database help on these cases?
5)if it is possible, how can pct_free help to solve or decrease concurrency? I read something about this too but it was not very clear.
Thank you for your attention.
Sidney
April 23, 2010 - 2:17 pm UTC
1) latches are locks - serialization devices - use to control concurrent access to in memory data structures that are shared by many thread.
It is not "memory blocks", it is shared data structures. Like the shared pool for example.
there are latches that protect MANY structures, not all are buffer cache related, in fact - more are not.
You'd have to see what the latch is for - say it was for the shared pool - well, tuning your buffer cache wouldn't do anything for you.
2) I don't know what you mean.
3) v$ views, statspack/awr reports, wait events in ASH or tkprof reports.
4) latches are used in response to the client asking us to do something, so yes, they are associated with how the application uses the database.
Say the application parses 100 queries per second
Say further, the application does use any binds
You'll hard parse like mad, you'll have shared pool and library cache latching like crazy and the only way to fix it would be to reduce the hard parsing - which is induced by the way the application was coded.
Say the application forces the use of indexes inappropriately - like this:
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:6749454952894#6760861174154 that'll latch on the cache buffer chain (buffer cache) latch like crazy due to the excessive logical IO's. It has to do with the way the application is coded.
5) pct free is used to avoid migrated rows - not concurrency so much. However, since pct free will limit the number of rows per block it can have an indirect affect on concurrency. But probably not latching issues - more like buffer busy wait events (too many people wanting to modify the same block at the same time - if you reduce the number of rows per block, it might be safe to assume that the number of concurrent accesses of that block would decrease as well)
Still about latches
Sidney Pryor, April 26, 2010 - 10:17 am UTC
Tom,
First I want to thank you for answering all my questions above.
1) Understood, thank you. So i can assume that latches control (serialize) access to a group of memory blocks that together is a shared data structure? For example, when a session waits for "latch:cache buffer", this means that it is waiting for 1 serialization device specifically that controls access to some specific data, or is it waiting for 1 free serialization device to access the buffer cache area?
2) On this question, i was willing to know if there is some environment parameter that can be tuned on OS system in order to improve or decrease latch contention?
3) Ok, perfect, thank you;
4) Sorry, i am not sure about what you really meant with "the application does use any binds". Did you meant "Not use any binds"?
5) Perfect clear, thank you very much.
Best regards,
Sidney
April 26, 2010 - 10:33 am UTC
1) shared data structures, do not think in "blocks", think arrays, linked lists, record structures - think pointers from C.
when waiting for example for the cache buffer chains (cbc) latch, it is waiting to access a list of database blocks that are in the cache - these blocks represent a subset of all the blocks in the cache.
2) not really - not an environment variable really. Latch contention happens because of too many concurrent requests for the same resource. The way to reduce that contention is to reduce the need for all of those concurrent requests.
EG: reduce hard parsing, that'll reduce the shared pool latch requests.
tune a query that is using indexes inappropriately, reduce the cbc latches requested.
4) yes, that does should have been doesn't
Latch Sleep Breakdown
Arvind Mishra, August 01, 2011 - 12:01 am UTC
Hi Tom,
1) In your book 'Expert Oracle Database Architecture', in chapter 6, Locking and Latching's page 225 while discussing latch sleep breakdown you wrote that ....to perform two units of work we needed to use 3 units of CPU.. I am not sure how you reached on that conclusion? The data shown on the page is:
Latch Name Requests misses sleeps sleeps 1->3
-------------------------------------------------------
shared pool 1,126,006 229,537 406 229135/398/4/0
library cache 1,108,039 43,582 7 45575/7/0/0
2) On page 226 of the same book you write " In our tests, because we have single user example, we conclude that about 22 CPU seconds was spent spinning on the latch" But elapsed time for single user is 0.52 and double user is 0.78. Therefore 78-52 = 24. Is this a misprint or I am comparing wrong data?
3) What is the meaning of data in column Sleeps 1->3+?
4) Where can I find full details about data contained in stats pack report?
5) What are the differences between statpack report and new AWR report?
Thanks,
Arvind Mishra
August 01, 2011 - 11:54 am UTC
1) Look at the CPU time used at the aggregate level. When two users ran, the total database cpu time was tripled. You cannot look at the latch report for that - look at the instance activity.
We used 26 CPU seconds for the single user, we used 74 CPU seconds for two users. Tripled cpu. Doubled work.
2) elapsed time is not meaningful when discussing the numbers we are discussing, we are looking at CPU TIME.
3) how many times someone slept ONCE for the latch, then twice, then three times, then three or more.
229135/398/4/0 => 229135 times we slept once, 398 times twice, 4 times thrice, and 0 times more than three time.
4) reference guide details many of the statistics. Most are obvious - the less obvious are documented in the reference guide available on otn.oracle.com and or the performance guide available on the same site.
5) statspack is sort of frozen in time, will not be enhanced necessarily, AWR and stats pack have a large amount of overlap - but will diverge over time as more and more is added to AWR.
Latch Issue in Oracl 10G
Santosh, March 15, 2013 - 5:31 am UTC
Hi Tom,
We are having latch issue in one of 10G live database,
there is a batch job which select details from v$latch ,but the select on v$latch getting hang and its consuming 100% CPU.
We checked the AWR report but latch event is not reported in the top 5 events.
Can you please help ??
Thanks,
Santosh.
March 25, 2013 - 9:40 am UTC
please utilize support for something like this.
and question why a batch job is selecting from v$latch - an in memory data structure which itself has to be protected by latches to be accessed. You do know that frequently accessing v$ tables is going to itself be a performance issue. I cannot imagine the reason a batch job would be selecting from this?