Thanks for the question, josh.

Asked: February 14, 2011 - 9:58 am UTC

Last updated: February 22, 2011 - 9:53 pm UTC

Version: 10.2.0

Viewed 1000+ times

You Asked

I have a piece of sql i am trying to tune, i run through autotrace in sqlplus...i get 82k consistant gets, i make my change to be more efficient(i think) i get 140k consistant gets(all other values stay the same). A simple explain plan shows lower costs for my more efficient version and run through the sql advisor also shows lower cost figures for my more efficient version. So i have greater consistant gets but a lower cost?

and Tom said...

Because cost is a number computed by the optimizer based on estimates it makes using imperfect information.

You might want to read through these entire post on the subject:

https://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:313416745628

If you put the two plans together - does the optimizer come up with exactly the same estimated cardinalities ( I doubt it ) - given that the cardinality estimates are the single biggest contributor to cost - unless it gets the same exact estimate for both - it is unlikely to have costs that can be even remotely comparable.

I would suspect that on your 2nd plan - it underestimated the cardinalities at some point. You can verify this via:

<b>run the query with this 'hint'</b>

ops$tkyte%ORA11GR2> select /*+ gather_plan_statistics */ *
  2    from t where flag1='N' and flag2='N';

no rows selected

<b>and then:</b>
ops$tkyte%ORA11GR2> select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));

PLAN_TABLE_OUTPUT
-------------------------------------------------------------------------------------------------------------------------
SQL_ID  3qazrn9xhg3zt, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ *   from t where flag1='N' and
flag2='N'

Plan hash value: 1601196873

-------------------------------------------------------------
| Id  | Operation         | Name | Starts | E-Rows | A-Rows |
-------------------------------------------------------------  ...
|   0 | SELECT STATEMENT  |      |      1 |        |      0 |
|*  1 |  TABLE ACCESS FULL| T    |      1 |  18106 |      0 | ...
-------------------------------------------------------------

It shows the e-rows (estimated row) versus a-rows (actual rows). As you can see here in this example - the estimate was WAY above the actual - meaning the optimizer didn't guess the right estimated cardinality. Something you can "fix" in various ways.

For example, the optimizer had to guess when evaluating the predicate 'flag1=N and flag2=N' - because it would have statistics on flag1 and statistics on flag2 - but NOT on flag1 and flag2 at the same time. We could fix that via:

a) extended statistics in 11g
b) dynamic sampling in 9i and above
c) sql profiles in 10g
d) the cardinality hint
e) perhaps a function based index
....

for example:

ops$tkyte%ORA11GR2> select /*+ dynamic_sampling(t 3) */ * from t where flag1='N' and flag2='N';

Execution Plan
----------------------------------------------------------
Plan hash value: 470836197

-------------------------------------------------------------------------------------
| Id  | Operation                   | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |       |     6 |   606 |     2   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T     |     6 |   606 |     2   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | T_IDX |     6 |       |     1   (0)| 00:00:01 |
-------------------------------------------------------------------------------------

Rating

(5 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

josh p, February 16, 2011 - 3:25 am UTC

You have given me some reading to do there..ive skimmed it, will need to go back over.

If i can perhaps simplify my question. I ran the two statements through a trace. tkprof the results. I ran each statement a few times to get an average.

The statement that runs twice as fast, does twice the number of consistant gets... (optimizer is set to all_rows, no hints in query). So which is better(ie. less load on the machine), the quicker one?...or the one with fewer consistant gets? I always read you should aim for lower LIO, as this means less work and a faster result...but lower LIO is slower in my case?

February 16, 2011 - 11:00 am UTC

show us the work - show us the tkprof... So we can evaluate the numbers.

each logical IO you do, in general, requires a cache buffer chains latch. Each latch requires a certain amount of CPU and the more users that request the same latch - the more CPU you require. That is - the more concurrency there is - the higher the contention for a given latch - the more CPU you will need to execute a given query.

So, while a query might do more logical IO and run faster than a query that does less logical IO IN ISOLATION, in single user mode - you may well find it takes longer in real life, when multiple users execute it.

for example, here is a query run in single user mode, compared to the same query run with 8 users (contention)

select /*+ first_rows */ id, object_name
from
 big_table order by id


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch    66668      1.07       1.10          0     148830          0     1000000
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total    66670      1.08       1.10          0     148830          0     1000000

********************************************************************************
select /*+ first_rows */ id, object_name
from
 big_table order by id


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch    66668      3.79       3.76          0     148830          0     1000000
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total    66670      3.80       3.77          0     148830          0     1000000

see the CPU increase? that was spinning on the cache buffers chains latch.

CPU increase due to cache buffer chain latching?

A reader, February 16, 2011 - 4:38 pm UTC

Tom- can you please clarify the difference between the statement you used "that was spinning on the cache buffer chains latch" vs a session waiting on a cache buffer chain latch?

Based on my experience with cache buffer chain latches I would have expected the 2nd test to show CPU time of something like 1.08 seconds with the elapsed time of 3.77 seconds with wait time on cache buffer chain latch of 2.69 seconds- that the contention on this latch ends up with the session waiting on it, not consuming more CPU due to it.

Could you explain the difference here? Thanks

February 16, 2011 - 5:41 pm UTC

we SPIN on latches - we do not wait on a latch until we've SPUN 2000 times. The latch get algorithm looks a bit like this:

loop
   for i in 1 .. decode( cpu_count, 1, 1, 2000 )
   loop
     try to get latch
     if gotten return it
     if (i=1) then increment misses on this latch end if
   end loop
   increment wait count on this latch
   sleep a bit
end loop;

So, if we are "alone" in wanting the latch - instantaneous - we get it.

If we are not alone and someone else has it - we spin on it (consume cpu). The reason we do not go to sleep (sleeping takes a LONG time - seconds) is that we are on the cpu, getting on the cpu is EXPENSIVE and we don't want to get switched off the cpu - so we spin and try to get the latch. Latches are by definition serialization devices that we hold for a very very very very short period of time - so we are expecting the latch to be given up very very very soon (another reason to NOT vacate the cpu).

as you increase contention for latching - you increase the cpu you are going to use up (as many people spin when contending for the same latch). you decrease scalability (because you have a serialization device in the mix AND you are using cpu for spinning- not working)

search google for spin locks.

tkprofs..

josh p, February 17, 2011 - 3:32 am UTC

Ok here are the tkprofs.. as you can see the change(to make it more efficient?!) resulting in an index being used on a 4.6mil row table...the speed of the query doubled, but the consistent gets also doubled. I feel im missing some fundamentals here...(to confirm all table stats are current and accurate and no hints were used)

NO INDEX on large table..
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 3 0.00 0.00 0 0 0 0
Execute 3 0.00 0.00 0 0 0 0
Fetch 3 3.84 3.75 0 244071 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 9 3.84 3.75 0 244071 0 0

Misses in library cache during parse: 0
Optimizer mode: ALL_ROWS
Parsing user id: 861 (MARINER)

Rows Row Source Operation
------- ---------------------------------------------------
0 HASH JOIN RIGHT OUTER (cr=81357 pr=0 pw=0 time=1251038 us)
5647 INDEX FAST FULL SCAN ACC_TERMS_AGENCY_IDX_AGTERMS (cr=23 pr=0 pw=0 time=32 us)(object id 64972)
0 FILTER (cr=81334 pr=0 pw=0 time=1249100 us)
28 HASH JOIN RIGHT OUTER (cr=81334 pr=0 pw=0 time=1305528 us)
18358 TABLE ACCESS FULL AGENCY (cr=404 pr=0 pw=0 time=24 us)
28 HASH JOIN (cr=80930 pr=0 pw=0 time=1294892 us)
213426 TABLE ACCESS FULL RES_HEADER (cr=76947 pr=0 pw=0 time=12391940 us)
4640 TABLE ACCESS FULL RES_CHARGE_DETAIL (cr=3983 pr=0 pw=0 time=60755 us)

INDEX on large table..
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 3 0.00 0.00 0 4 0 0
Execute 3 0.00 0.00 0 0 0 0
Fetch 3 1.85 1.81 0 417984 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 9 1.86 1.82 0 417988 0 0

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 861 (MARINER)

Rows Row Source Operation
------- ---------------------------------------------------
0 HASH JOIN RIGHT OUTER (cr=139328 pr=0 pw=0 time=609277 us)
5647 INDEX FAST FULL SCAN ACC_TERMS_AGENCY_IDX_AGTERMS (cr=23 pr=0 pw=0 time=19 us)(object id 64972)
0 FILTER (cr=139305 pr=0 pw=0 time=606847 us)
28 HASH JOIN RIGHT OUTER (cr=139305 pr=0 pw=0 time=662682 us)
18358 TABLE ACCESS FULL AGENCY (cr=404 pr=0 pw=0 time=23 us)
28 HASH JOIN (cr=138901 pr=0 pw=0 time=651448 us)
213426 TABLE ACCESS BY INDEX ROWID RES_HEADER (cr=134918 pr=0 pw=0 time=426893 us)
238553 INDEX SKIP SCAN RES_HEADER_FK_SAILFR (cr=1015 pr=0 pw=0 time=330 us)(object id 66605)
4640 TABLE ACCESS FULL RES_CHARGE_DETAIL (cr=3983 pr=0 pw=0 time=60780 us)

It is being run on a test system, so very little contention with other processes..

February 17, 2011 - 8:33 am UTC

here is my educated guess based on what we have to look at here.

Suppose you get about 100 rows per block for RES_HEADER. That would mean there are about 7.6 million rows in the table (lets assume it is between 3 million and 7.3 million..)

The full scan of res_header must be evaluating a predicate of the form "where low_distinct_cardinality_column = ? and some_other_column <op> ?" - I say low_distinct_cardinality because of the skip scan in the second plan. I think there is likely some other column involved because of the skip scan and the fact that we get more rows out of the index than we get out of the table (indicating there is some additional filtering going on that the index cannot do by itself) - and maybe more. I don't know the predicate operator on the second column, let's assume it is equals.

So, we are evaluating "where low_distinct = ? and some_other_column = ?" between 3 and 7.6 million times.

Now, skip to the index version. We get 238,553 out of the index and then get 213,426 rows out of the table. We had to evaluate that where clause 238,533 times - at LEAST an order of magnitude less times than we did for the full scan.

Since you ran single user mode, the cost of a consistent get is very "cheap", there is no contention. The cost of evaluating the where clause is constant - it takes the same amount of CPU regardless of the number of users - because it is "contention free"

Therefore, if you are single user system - the optimizer did the right thing by costing out CPU and IO (that is what the cost is - a function of IO and CPU costing). But - if you run lots of users at the same time - the true cost of logical IO will go up and up and up as you experience contention for cache buffer chains latched (as demonstrated above)

Did that all make sense?

Follow up question to cache buffer chain latching reply

A reader, February 17, 2011 - 9:40 am UTC

Thanks for the details on the cache buffer chain latching logic. I know in the past I've looked at trace files where CPU/Elapsed time was much higher at 1 time vs another time yet logical/physical reads were similar so this could explain it- in the past I have not been able to explain why that would happen.

Based on your test with the cache buffer chain contention I assume the wait events for that query wouldn't show much (if any) time spent actually waiting on cache buffer chain latching- the CPU time is higher than elapsed time so there isn't any time waiting on anything.

Assuming a trace didn't show the cache buffer chain latching how would I know the higher CPU time in that trace was in fact due to extra spinning in the cache buffer chain latching? In other words- if you got those 2 trace files and didn't know the 2nd test was ran under concurrency what statistics would show there was extra CPU time due to spinning on the cache buffer chain latch?

Are there other scenraios where the same query does the same logical/physical reads yet one consumes more CPU than the other that are not due to cache buffer chain latching?

This is very interesting. Thanks in advance.

February 17, 2011 - 11:51 am UTC

Based on your test with the cache buffer chain contention I assume the wait
events for that query wouldn't show much (if any) time spent actually waiting
on cache buffer chain latching- the CPU time is higher than elapsed time so
there isn't any time waiting on anything.

correct - unless the contention was so severe that I spun more than 2000 times - there would be NO waits - just misses on the latch.

Assuming a trace didn't show the cache buffer chain latching how would I know
the higher CPU time in that trace was in fact due to extra spinning in the
cache buffer chain latching?

you wouldn't see it in a trace, you would only see it at the "system" level and it would be visible via a high number of "latch misses"

                                       Get                                 Spin
Latch Name                        Requests       Misses      Sleeps        Gets
-------------------------- --------------- ------------ ----------- -----------
cache buffers chains             2,474,493        3,313          79       3,237
shared pool                         39,878           67          20          54
row cache objects                   66,057           24           3          21
          -------------------------------------------------------------

you'd see something like that in your statspack/awr reports (that is from my example above). As spin gets go up, so goes your CPU

Are there other scenraios where the same query does the same logical/physical
reads yet one consumes more CPU than the other that are not due to cache buffer
chain latching?

sure, you might take a query like:

hash join
   full scan t1
   full scan t2

and turn it into 

nested loop
   table access by index rowid
      index range scan
   table access by index rowid
      index range scan

and find the cpu to be less in #2 even if the IO's are more (particularly in single user mode). The reason:

o hash join is more cpu intensive than nested loop in general. We have to hash all of the values we get from T1 into memory/temp and then do the hash lookup from t2 into t1. Hashing can be a cpu intensive sort of operation

o we probably evaluate some predicate more often in case #1, just like your example above

There are other examples too - not all logical IO's are 'equal', some are cheaper than others - there are reads from undo for read consistency - they can be "cheaper" than a read from a table. There are new "fastpath" consistent gets. You might have to process delayed block cleanouts for one query and not for another (by hitting different sets of blocks - or by hitting the same block over and over and over again instead of hitting all of the blocks in the table).

josh p, February 21, 2011 - 7:55 am UTC

So... not all consitant gets are equal...see below..double the cr, but only a fraction of the time....This is why tkprof i posted show fts to be slow but have fewer consistant gets...

fts (cr=76947 pr=0 pw=0 time=12391940 us)

table access by index (cr=134918 pr=0 pw=0 time=426893 us)

so if i cant use cost, cant use consistant gets... (I also see other examples where optimizer chooses an index skip scan over an index range scan even though the range scan can reduce the consistant gets by half...should i hint it?!)

Also it doesnt matter whether i run this particular example on a production(many user/processes) system or a test system with very few users...the optimiser still favours the plan with the greater consistant gets

Thanks

February 22, 2011 - 9:53 pm UTC

I'm not sure what you are trying to say here - as we see only a tiny bit of information- a snippet of a bigger thing - so - i guess.. whatever?

More to Explore

Performance

Get all the information about database performance in the Database Performance guide.

Breadcrumb

Question and Answer

You Asked

and Tom said...

Rating

Comments

CPU increase due to cache buffer chain latching?

tkprofs..

Follow up question to cache buffer chain latching reply

More to Explore

Performance

When consistant gets go up but costs go down?

Breadcrumb

Question and Answer

You Asked

and Tom said...

Rating

Comments

CPU increase due to cache buffer chain latching?

tkprofs..

Follow up question to cache buffer chain latching reply

More to Explore

Performance