Skip to Main Content

Breadcrumb

May 4th

Question and Answer

Tom Kyte

Thanks for the question, Steve.

Asked: July 13, 2006 - 10:34 am UTC

Last updated: November 14, 2012 - 10:14 am UTC

Version: 9.2.7

Viewed 1000+ times

You Asked

Tom,

I'm a tuner for approximately 100 Oracle databases on IBM AIX at a large institution. Most of the databases are primarily read-only. Some experience a lot of activity, others are less frequently used.

When I recently came on board here, I noticed that the DBAs here were not using CPU costing with their Oracle9i databases. So, I collected system stats during a representative workload period using:

DBMS_STATS.GATHER_SYSTEM_STATS(
gathering_mode => 'INTERVAL',
interval => 90 );

When I looked at the stats that were collected, I was surprised to see that with most of the databases, mreadtim < sreadtim. For example, here are the results of the following query with four databases on the same shared host:

select pname, pval1
from sys.aux_stats$
where sname = 'SYSSTATS_MAIN'

db1

CPUSPEED 609
MAXTHR 9189376
MBRC 4
MREADTIM 0.073
SLAVETHR -1
SREADTIM 0.081

db2

CPUSPEED 610
MAXTHR 15090688
MBRC 7
MREADTIM 1.967
SLAVETHR -1
SREADTIM 4.103

db3

CPUSPEED 610
MAXTHR 350120960
MBRC 12
MREADTIM 1.103
SLAVETHR 9216
SREADTIM 2.372

db4

CPUSPEED 608
MAXTHR 32674816
MBRC 14
MREADTIM 0.067
SLAVETHR -1
SREADTIM 6.145

The situation where mreadtim is less than sreadtim is not documented in the Oracle Performance Tuning Guide, nor is it covered in Jonathan Lewis' book. However, I have been able to Google it and read about in several forums. I wasn't able to get any hits at asktom.oracle.com.

</code> http://oraclesponge.blogspot.com/2005/12/playing-with-system-statistics.html http://www.orafaq.com/usenet/comp.databases.oracle.server/2005/09/14/0925.htm <code>

The two references above include discussions from Jonathan Lewis and a few other notable Oracle users.

Anyway, mreadtim < sreadtim seems to be common when the host's storage is a SAN, which is my case as well. The problem, as I understand it, is that Oracle9i won't use CPU costing when mreadtim is less than sreadtim. So, the suggestion I've read is to manually adjust mreadtim to something just above sreadtim.

Questions:

1. Do you concur with this advice across the board, to tweak the mreadtim statistic to just above sreadtim? If so, please explain the rationale. If not, please provide me with some guidance on how to determine what an appropriate setting for mreadtim would be. My understanding is that we are trying to provide Oracle with these statistics so that it can better determine whether it is cheaper to use indexes or full table scans for a given query. "Tweaking" the mreadtim statistic to something artificial doesn't seem to make much sense. What if mreadtim is truly less than sreadtim - I'd be losing all the benefit of that big SAN read-ahead cache that I have.

2. If I were to tweak the mreadtim statistic as suggested in 1), the primary benefit would be that Oracle9i would calculate execution plan costs considering the MBRC statistic rather than the db_file_multiblock_read_count parameter, right?

3. An interesting side note. All of my databases are on the same SAN. I've noticed that not all databases exhibit mreadtim < sreadtim. It appears that with the data warehouses that service very demanding queries, mreadtim > sreadtim. For example, here's the output from one such database:

CPUSPEED 608
MAXTHR 23887872
MBRC 9
MREADTIM 17.632
SLAVETHR -1
SREADTIM 13.373

Please correct me if I'm wrong, but my take on this is that the mreadtim statistic gets higher when there are queries that are doing table scans of very large tables because the SAN's read-ahead cache is only so big and the disk I/O eventually make the mreadtim statistic higher than sreadtim.

Thanks,
Steve


and Tom said...

I decided to outsource this one :) I asked Jonathan Lewis to comment and he wrote:

Tom,

A few comments:

Strategically I always try to tell the truth to the optimizer,
and work around any (hopefully short-term) problems.
The case of system statistics, however, is particularly tricky,
because the thing that Oracle can measure may not be
directly correlated to the action that is causing a problem.

The problem of mreadtim being less than sreadtim arises from
the way that SANs tend to be very good (and insanely
enthusiastic) about using read-ahead. Consequently a SAN
can be very good at supplying data for multi-block reads,
with the side-effect that its response time for highly random
single block reads degrades.


Ideally, the mreadtim and sreadtim should be measuring
the actual time off disc for data - on the basis that the
available disc usage time is the resource that is limited.
Unfortunately, what Oracle is measuring is time through
SAN cache - and since the SAN does not treat small
and large reads equally, Oracle gets the wrong idea of
what resources are available and how best to use them.

What makes it particularly difficult is that the 'degree of
unfairness' is dependent on the nature of your application,
and is impacted by the activity of other systems sharing
the same spindles.

I am still unsure of the best way to deal with the problem,
but one thought is to use the gathered statistics to give
Oracle a value for the MBRC value, and then use an
external calibration tool to time the hardware on randomly
scattered reads of the single and multiblock read size.
Alternatively, use the manufacturers spec sheets to calculate
the notional sreadtim and mreadtim - based on the MBRC,
I/O seek time, and data transfer rates (which is roughly what
10g does if you don't a gather)


Regards

Jonathan Lewis
</code> http://www.oracle.com/technology/community/oracle_ace/ace1.html#lewis

The Co-operative Oracle Users' FAQ
http://www.jlcomp.demon.co.uk/faq/ind_faq.html

Cost Based Oracle: Fundamentals
http://www.jlcomp.demon.co.uk/cbo_book/ind_book.html <code>


Rating

  (16 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

mreadtim > sreadtim benefits

Connor, July 17, 2006 - 10:24 am UTC

Just to add some clarification to my newsnet post.

One reason its useful to keep cpu costing enabled (by tweaking mreadtim) is that you get that extra control over multiblock reads. For example, you can have MBRC fixed at the statistically measured value (say 12), but set db_file_multiblock_read to your platform's max (eg 128).

So you *cost* queries sensibly, but then defer to get the most of your platform's IO for multiblock reads. In fact, I somewhat stumbled upon this during support for a Sun problem where the different system calls (pread was one, can't remember the other) were used when multiblock reads > 32. In that particular case, setting MRBC to 16 and db_file_... to 128 gave large benefits due to the different system call being used.

hth
Connor

mreadtim < sreadtim

Jason B., August 12, 2008 - 1:01 am UTC

I don't entirely buy Jonathan Lewis's explanation here. (Or maybe I simply don't understand it.)

If a SAN is giving priority to large reads, I can understand the per-block average time being better for multiblock reads than for single block reads. (Indeed, that's the whole point of multiblock reads!)

But the system statistic mreadtim is the transfer time for the entire transfer, not for each block! I don't see how it could be that much of an extra burden to read a chunk of blocks and then throw most of it away, if it came to that.

If scheduling, seeking and cache eviction to handle new requests hurt single-block reads, wouldn't they also hurt multiblock reads?

So I think I'm missing Jonathan's point.

Now, what I've read was that V$FILESTAT (with X$KCFIO beneath it) is unreliable in the face of async I/O. I've seen vague mentions on the Internet via Google, and three bugs on Metalink (searching for "V$FILESTAT") that don't say anything conclusive.

I'm faced with this situation on a 9.2 database of ours, and I've resorted to running some sample queries with tracing turned on, and doing some Perl magic on the trace files to do the math. (Hoping I can get a big enough collection of queries, and suitable load on the system... etc. etc. It's not ideal but at least it's understandable.)

Tom Kyte
August 13, 2008 - 4:02 am UTC

The mreadtim is artificially low - due to read ahead - when you issue the large contiguous multiblock reads - they are reading from the san cache - not from disk.

and then you start doing random IO's - single block IO's. The read ahead is still happening, but since they are here, there and everywhere for a while - you are truly doing physical IO's (and lots of physical IO's you don't care about, won't use), so the sreadtim gets inflated.

It is due to the IO pattern, it is not just a matter of skipping some blocks for the sreadtim's - rather the sreadtim's are doing true physical IO, the mreadtim's - due to the read ahead - are reading from the cache of the SAN

Illustration ... hopefully correct

Michael, August 13, 2008 - 7:22 am UTC

Let's illustrate it:

If this is our (SAN-) Disk (one "-" stands for 512 KB):

|--------------------------------------------------|

DB_BLOCK_SIZE = 8 KB, DB_FILE_MULTIBLOCK_READ_COUNT = 128 (= maximum of the instance os platform).
So, in a multiblock read we read 128 * 8 KB = 1 MB. (= "--")
Now, suppose the SAN (or other disk subsystem) reads on every access 4 MB ahead.

1. Oracle does a full table scan on a table (scattered read = multiblock io).

2. What is read on the first multiblock read request:

|------XXRRRRRR------------------------------------|
TTTTTTTTTTTTTT

X = Read and transfer to Oracle in this first multiblock io.
R = Read ahead by the SAN.
T = Blocks of the table on the disk.

3. Oracle issues futher multiblock io requests to the SAN. It's quite likely that the requested blocks are already in the read ahead cache (the R's) of the SAN. In our example this is the case.

4. In this example the first 4 multiblock request would require only one physical disk access because of the read ahead. A second physical disk access will be required to get the rest of the table blocks into the read ahead cache of the SAN.

5. So, for Oracle there were 7 multiblock io requests, but in reality there were only 2 physical disk accesses which consumed time in a significant extent. So, drastically simplified, for Oracle a multiblock io request requires 2/7 ms or sec ... whatever.

6. Now consider that for random io the read ahead cache is (nearly) useless -> one random (single) io request translates into one physical disk access. So, again drastically simplified, for Oracle a random (single) io request requires 1/1 ms/sec.

Tom, is my understanding (and illustration) of this issue correct?



Tom Kyte
August 18, 2008 - 9:16 am UTC

#3 and if Oracle issues a single block RANDOM IO, the odds are it is NOT in the cache.

we could stop there....


yes, you got it.

Ahh

A reader, August 14, 2008 - 4:43 pm UTC

So he was talking about readaheads that are *larger* than mbrc * block_size! I kept thinking only about readahead within one multiblock read...

Based on some things I've read, AIX breaks up multiblock reads into individual block reads at the device/SAN level, depending on your filesystem configuration. In those cases, multiblock read requests don't hit the SAN, but you're saved by SAN readahead.

I had that stuck in my head when reading Jonathan's writeup.

Got it, got it... thanks for the brain nudge. :)

Not only for SANs ...

Michael, August 18, 2008 - 10:19 am UTC

Some further thoughts about this issue:

- You don't have to use a SAN to see this issue. Even on my laptop 2.5" hard drive, i was able to observe this issue (mreadtim << sreadtim), because the drive has an 8 MB cache that is used for some kind of read ahead.
OK, i was the only user on this instance, but in principle this "problem" is not restricted to SANs.
And don't forget for tests like this:
alter system set filesystemio_options = directIO

- If (and only if!) the sequence and data distribution is like in the above example, the artificial low value for mreadtim is reasonable - because the read ahead behavior gives us in effect a super big DB_FILE_MULTIBLOCK_READ_COUNT. (In the above example this was 4 MB / 8 KB = 512.)

- However, if the data distribution is more like this:

t  |------TTTT----------TTTT------TTTT--------TT------|
1         11RRRRRR                                    
2           22                                        
3                       33RRRRRR                       
4                         44                           
5                                 55RRRRRR             
6                                   66                
7                                             77RRRRRR  

11, 22 etc. = Read and transfer to Oracle in multiblock io.
R = Read ahead by the SAN.
T = Blocks of the table on the disk.

Now, dbms_system.gather_system_stats would calculated a mreadtim of - simplified - 4 / 7 ms/s.

Is this result correct or practicable?
Well, i think there is no simple answer: it depends - on the degree in which you can benefit from the read ahead mechanism of the storage device. (And this depends on how your data is organized on disc, how the storage device is used by other sessions etc. - many things to consider ...)

SREADTIME > MREADTIM issue

Yoav, May 08, 2009 - 9:16 am UTC

Hi Tom,

Version 10204.

I collected system statistics for few day and found
that almost always SREADTIME > MREADTIM as you can see bellow:

DATEM SREADTIM MREADTIM CPUSPEED MBRC [blocks]
[ms] [ms]
==========================================================
03/01/2009 00:33 10.98 5.16 1113 6
03/01/2009 01:33 11.34 4.27 1064 6
03/01/2009 02:33 12.56 5.73 1037 8
03/01/2009 03:33 8.42 3.55 999 8
03/01/2009 04:33 8.69 2.69 1087 8
03/01/2009 05:33 8.11 4.64 1079 7
03/01/2009 06:33 8.25 4.65 1053 6
03/01/2009 07:33 4.55 3.95 1115 7
03/01/2009 08:33 12.99 1.41 1075 8
03/01/2009 09:33 13.73 2.50 964 7
03/01/2009 10:33 3.31 1.73 984 8
03/01/2009 11:33 1.95 3.65 1044 8
03/01/2009 12:33 2.38 2.85 1102 9
03/01/2009 13:33 9.22 2.70 1111 8
03/01/2009 14:34 10.20 2.18 1116 9
03/01/2009 15:34 11.37 2.04 1121 9
03/01/2009 17:34 9.14 1.99 1119 12
03/01/2009 16:34 10.98 5.34 1111 9
03/01/2009 18:34 12.09 2.64 1120 11
03/01/2009 19:34 6.97 1.47 1121 10
03/01/2009 20:34 4.69 1.22 1118 8
03/01/2009 21:34 6.98 2.44 1047 11
03/01/2009 22:34 4.32 3.12 1120 6
03/01/2009 23:34 3.59 5.40 1109 5
...

This SANs storage (HP, XP12000) and as mr. Jonathan Lewis wrote and
confirmed by vendor the problem related to the read-ahead.

I decided not to implement the system statistics and as suggested ,
asked the vendor to supply :...."the manufacturers spec sheets to calculate
the notional sreadtim and mreadtim - based on the MBRC,
I/O seek time, and data transfer rates..."

A dba vendor suggest to :
"
Due to the read ahead used by XP, sreadtim is greater than mreadtim.
So the CBO optimizer will not accept the workload system statistics.
So the optimizer stops to use the cpu algorithm costing.
I think this is why the customer complains that sreadtim > mreadtim and decided NOT
to implement the system statistics.

Now I would suggest two tests :
- the customer might need to adjust the mreadtim and sreadtim manually using
the get_system_stats/set_system_stats procedure from the dbms_stats package and check
how the performance are affected on the sql queries

- the customer can do several tests with db_file_multiblock_read_count = 8, 16, 32, 64, 128
and check if the sreadtim and mreadtim statistics are changing when using
different "db_file_multiblock_read_count" settings.
"

After reading Mr. Lewis blog : http://jonathanlewis.wordpress.com/2007/04/30/system-statistics/
I am not 100% sure that those are the right test cases i should take.
Mr lewis suggest a different approach to collect those values and also said
"...The intention was to do a completely ¿off-db¿ test, using an external program"

I would appreciate your/Mr. Lewis opinion regarding the test case suggested by the vendor

Thank you

CBO response in 10gR2

Andrew, January 11, 2011 - 9:36 pm UTC

Hi. The statement that CBO will ignore system stats if mreadtim<sreadtim applied for 9.2 in the original question. I can't find a clear answer as to whether this is true for 10.2. A 10053 trace on a 10.2.0.4 db with mreadtim=0.1 and sreadtim=0.6 (real stats!) seems to suggest that CBO may still use the values. Oracle doc is silent on CBO interpretation of such values (or I haven't found the relevant/useful ref far). Can you clarify if I need to 'force' mreadtim>sreadtim in order for system stats to be used in 10.2.0.4, please. Or point me to the relevant Note/doc.? Thanks, Andrew
Tom Kyte
January 12, 2011 - 10:49 am UTC

having sreadtim > mreadtim - but with different values - affects the plan generated by the optimizer:

ops$tkyte%ORA10GR2> /*
ops$tkyte%ORA10GR2> drop table t;
ops$tkyte%ORA10GR2> create table t as select * from all_objects order by dbms_random.random;
ops$tkyte%ORA10GR2> exec dbms_stats.gather_table_stats( user, 'T' );
ops$tkyte%ORA10GR2> create index t_idx on t(object_id);
ops$tkyte%ORA10GR2> */
ops$tkyte%ORA10GR2>
ops$tkyte%ORA10GR2> exec dbms_stats.set_system_stats( 'MREADTIM', 0.1 );

PL/SQL procedure successfully completed.

ops$tkyte%ORA10GR2> exec dbms_stats.set_system_stats( 'SREADTIM', 0.6 );

PL/SQL procedure successfully completed.

ops$tkyte%ORA10GR2> select pname, pval1 from aux_stats$ where pname in ('MREADTIM','SREADTIM');

PNAME                               PVAL1
------------------------------ ----------
SREADTIM                               .6
MREADTIM                               .1

ops$tkyte%ORA10GR2> set autotrace traceonly explain
ops$tkyte%ORA10GR2> select * from t t1 where object_id between 1 and 250;

Execution Plan
----------------------------------------------------------
Plan hash value: 1601196873

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   192 | 17856 |   179  (12)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T    |   192 | 17856 |   179  (12)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("OBJECT_ID"<=250 AND "OBJECT_ID">=1)

ops$tkyte%ORA10GR2> set autotrace off
ops$tkyte%ORA10GR2> exec dbms_stats.set_system_stats( 'MREADTIM', 0.1 );

PL/SQL procedure successfully completed.

ops$tkyte%ORA10GR2> exec dbms_stats.set_system_stats( 'SREADTIM', 0.2 );

PL/SQL procedure successfully completed.

ops$tkyte%ORA10GR2> select pname, pval1 from aux_stats$ where pname in ('MREADTIM','SREADTIM');

PNAME                               PVAL1
------------------------------ ----------
SREADTIM                               .2
MREADTIM                               .1

ops$tkyte%ORA10GR2> set autotrace traceonly explain
ops$tkyte%ORA10GR2> select * from t t2 where object_id between 1 and 250;

Execution Plan
----------------------------------------------------------
Plan hash value: 470836197

-------------------------------------------------------------------------------------
| Id  | Operation                   | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |       |   192 | 17856 |   199   (3)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T     |   192 | 17856 |   199   (3)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | T_IDX |   192 |       |     2   (0)| 00:00:01 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("OBJECT_ID">=1 AND "OBJECT_ID"<=250)

ops$tkyte%ORA10GR2> set autotrace off


More on 10gR2 Behaviour

Andrew, January 12, 2011 - 10:29 pm UTC

Thanks, Tom.

Adding to the above, I did some empirical tests.

From these, I could see that:
(i) when mreadtim<sreadtim, the mreadtim value is ignored and a default rule is adopted;
(ii) mbrc value is still used in this case ie. cost increases with smaller mbrc and decreases with larger mbrc, given an access path using multiblock reads (eg. FTS, FFS).

My tests put the mreadtim 0.01 mS faster or slower than sreadtim to minimise the effect of the mreadtim value itself in the costing wrto mbrc component.

Analysing the 10053 trace data, I'm pretty sure what happens is that, once mreadtim<sreadtim, the CBO assumes that a hypothetical 8-block mread takes the same time as a 1-block sread. From this premise, it extrapolates multiblock read times using mbrc value, and uses this number as the CBO cost of the mread.

eg. (with guesses about rounding functions - these numbers are not exactly the same as CBOs):

if mbrc=100, then mreadtim=ceil(100/8)*sreadtim. ie. an mread has a CBO cost of 13 per mread,
so a mread of 1000 blocks would cost (1000/100)*13=130.

if mbrc=10, then mreadtim=ceil(10/8)*sreadtim, ie. an mread has a CBO cost of 2 per mread
so a mread of 1000 blocks would cost (1000/10)*2=200

So, even if mreadtim<sreadtim, sreadtim is still used (per your example) and mbrc is still used (per the result above).

I can see that using these values is more sensible than using the gathered mreadtim, as we can reasonably assume that random mreads from a SAN will not normally deliver the cache-affected values that the gather obtained (depending on the workload in progress during gathering, of course).

The query as to whether it is desirable to 'tweak' the database's mreadtim value to 'force' use of mbrc and sreadtim is answered here, I think. Both your and my examples suggest that this is not in fact necessary, or even desirable. My thoughts are now to leave well enough alone, unless it is possible to gather a more accurate value for mreadtim.

Just for interest, I used Jonathan Lewis' reader/writer programs to try and obtain 'real' values for this server, and got values like 3mS for sread and 12mS for a 48-block mread (ie. mbrc as gathered). Based on its own gathered stats, the CBO would have used 0.6mS and (48/8)*0.6mS=3.6mS respectively, giving optimistic CBO I/O costs wrto CPU costs, and penalising multiblock access paths a bit more than it should wrto single block paths. Does this matter? I guess I'd have to bite the bullet and force the stats to find out...

For ref: multiblock_read_count=128 (default), DB: 10.2.0.4 EE, OS: AIX 5.3, Filesystem: jfs2.

Hope this clarifies the behaviour for others.

Andrew
Tom Kyte
January 13, 2011 - 7:24 am UTC

(i) The information isn't ignored, the fact that changing the sreadtim, while keeping it greater than the mreadtim, changed the plan showed that. It might be used differently - but it is definitely being used.


(ii) you shouldn't be setting db file multiblock read count at all in 10g. In 10g, we'll use the actual OBSERVED multi-block read count to cost the query - and always use the MAXIMUM multi-block read count for your operating system at query execution time. The best of all possible worlds - an accurate costing of a multi-block read using your actual observed historical read sizes - yet with the ability to do a maximum IO size whenever possible.




thanks for the detailed followup!

set mbrc=MAX OS mbrc?

Loïc, July 05, 2011 - 5:57 am UTC

"(ii) In 10g, we'll use the actual OBSERVED multi-block read count to cost the query - and always use the MAXIMUM multi-block read count for your operating system at query execution time."

Hi Tom,
do you mean that if I know what is *my* maximum mbrc for my OS, I can set it for the mbrc system statistic?

It would be 128 for a 8KB block size.

Thanks for the answer,
Loïc
Tom Kyte
July 05, 2011 - 12:50 pm UTC

do NOT set it.

we will always try to use the MAX multi-block read count when issuing the IO (if you do not set it).

however, when we COST the query, we'll use the IO size we've historically observed your system performing.


Meaning - you get a realistic COST and you get the best IO size possible at runtime.

Unless you set it, in which case - all of this good stuff goes away.

Lost!

Loïc, July 05, 2011 - 4:20 pm UTC

Tom, in my previous post, by setting it, I meant:

exec DBMS_STATS.SET_SYSTEM_STATS('mbrc',128);

Is your comment still true?
Tom Kyte
July 05, 2011 - 4:33 pm UTC

why would you want to do that? that would artificially deflate the cost of a full scan?

what is the reason for doing this - it will affect the costing of a full scan, but not the performance of a full scan.

Reality...

Loïc, July 06, 2011 - 3:12 am UTC

My opinion is as following:
If Oracle will read the MAXIMUM multi-block read count (which is 1MB per I/O on my server) when performing a full table scan then why not telling so to the CBO?
Isn't it more realistic?

I think that letting the system statistic MBRC smaller than what will be *really* done "would artificially inflate the cost of a full scan", no?



Tom Kyte
July 06, 2011 - 8:09 am UTC

sigh, i'll try again.


We monitor your system and determine what your true, observed, historically achieved multiblock read count is. we use that to COST the query.

However, at run time - we always try to issue the largest read we can.


You see, if you set your multiblock read count to 1mb - but on average we can only read 512k at a time (because some of your blocks are already IN THE CACHE), then you will have artificially lowered the cost of the full scan (the optimizer will cost it assuming you'll really be able to read 1mb when you can really only get 512k because some blocks are already in your cache) and that is all. You'll get full scans when you should have gotten some other access method.


Just because your read count is set to 1mb DOES NOT MEAN you will do 1mb IO's. Assume you are going to read a table using a full scan - and the table blocks start from 1. Also, assume the multiblock read count is 128 (128x8k=1mb).

Now, in your cache, block 42 already exists, it is there - someone read it via an index or something.

When you go to issue your first IO of blocks 1-128, we have to first check the buffer cache to make sure none of the blocks are in there (if they are - they are the blocks we need to process - we cannot take old blocks from disk and load them up). At this point we find that block 42 is there so we issue an IO request for

1..41 = 328k
43..128 = 688k

So you see, your historically observed multiblock read count is 63 blocks - not 128. You made the full scan seem cheaper than it ever could be.


We tell the optimizer what you have observed to be able to accomplish. Not the maximum you could in theory achieve if the stars were to align with jupiter and mars.

So, again, we COST the query using your observed, achieved multiblock read count - but we always attempt to issue the IO using the maximum possible size just in case we get lucky.

Got it now!

Loïc, July 06, 2011 - 10:02 am UTC

Sorry for disturbing you again about this subject. Now, I understand the cache plays a big role here.

My approach is definitely wrong, even more if my database has a large buffer cache!

Thanks Tom for this crystal clear precision :)

Sigrid Keydana, November 09, 2012 - 3:55 am UTC

Tom,

many thanks for your clarifications - I was actually referred to this post by Oracle support :-).

I have a number of questions still.
The context is, I have gathered system statistics on a production OLTP RAC 4-node system for ~ 2 weeks (using a custom table). This results in 308 1-hour measurements overall, and:
sreadtim (average/stddev/count): 1.79/3.36/265
mreadtim (average/stddev/count): 0.81/1.93/163
mbrc (average/stddev/count): 2/1.46/17


1) count(mbrc) < count(mreadtim)

I would expect the same number of valid, nonnull measurements for both mbrc and mreadtim - there should be a "common threshold" in both cases, so I wonder what's going on here.

2) What action to take based on these results

First option - set the averages in sys.aux_stats$?
With the extremely low mbrc of 2, this does not seem too dangerous to me - if it increases the percentage of index access paths, it will further decrease interconnect traffic (which soulds better than "the other way round" :-;

Second option – gather system stats directly during a representative period. Most probably, mbrc will end up null – now according to 11.2 documentation ( http://docs.oracle.com/cd/E11882_01/server.112/e16638/stats.htm#autoId24 ), db_file_multiblock_read_count will be used, if it is set to > 0. As recommended, I have unset this on the systems, and then a db_file_multiblock_read_count=8 will be used.
As by 11.2 Reference, even though now a default of 128 will be set by Oracle, it will NOT be used for estimation ( http://docs.oracle.com/cd/E11882_01/server.112/e25513/initparams056.htm#CHDFAFHE )

Third option – use noworkload statistics (gathered ones, not the default).

3) I understand that all the above relates to cost estimation, not runtime I/O. Regarding runtime I/O, the OS’s maximum I/O size or – as by 11.2 documentation – “the maximum I/O size that can be performed efficiently” will be used. How is this determined by Oracle?
My fellow linux admins tell me that there is no such thing as a maximal I/O size enforced by the OS (using, e.g., strace).

4) I am trying to calculate sreadtim, mreadtim and mbrc myself, from dba_hist_filestatxs. If my approach (and query) are correct, I get the following averaged results over the same period:

Calculated “sreadtim”: 3.15 (vs. 1.79 from gathering system stats)
Calculated “mreadtim”: 1.6 (vs. 1.02 from gathering system stats)
Calculated mbrc: 7.9 (vs. 2)

These are quite big differences (especially for mbrc), so I wonder if there are problems with either approach and/or query. Would you agree to have a look at the query and tell me your opinion?

Many thanks in advance!
Sigrid

Tom Kyte
November 09, 2012 - 6:57 am UTC

can you give me the step by step procedures you followed to get your 308 observations? If you don't do a lot of full scans, we are free to skip generating statistics associated with that.


2) since you don't full scan and don't expect to (it is OLTP), the setting isn't that relevant.

You might just set these based on the spec's for your storage system (the s/m readtim's).


3) it is port specific (just assume 1MB in general). You could test by filling up a table, flushing buffer cache and full scanning it with sql tracing with wait events enabled - you can see the block count in the trace file.

4) don't have mreadtim less then sreadtim, if you do, it isn't used - we'll compute one from the sreadtim, block size and mbrc.


See also:
https://forums.oracle.com/forums/thread.jspa?threadID=2179217#9386305

A reader, November 09, 2012 - 10:57 am UTC

refering to 2 post above your amazing expalanation related to multiblock IO.


When you go to issue your first IO of blocks 1-128, we have to first check the buffer cache to make sure none of the blocks are in there (if they are - they are the blocks we need to process - we cannot take old blocks from disk and load them up). At this point we find that block 42 is there so we issue an IO request for 

1..41 = 328k 
43..128 = 688k 

So you see, your historically observed multiblock read count is 63 blocks - not 128. You made the full scan seem cheaper than it ever could be.




will oracle consider min or max io for future full scans i.e. 328/8=41 blocks or 688/8=86 blocks?

i hope you mean either 41 or 86 block right, instead of 63 block?

Thank you,
Tom Kyte
November 09, 2012 - 3:29 pm UTC

Oracle will always try to use up to the max IO size to do the IO

We will use the observed average multiblock read count to cost the query.

Here - we averaged 63 blocks per multiblock read, we'll use that to cost the query (to assign a cost to the full scan).

But if we have the opportunity to read 128 blocks in real life when running the query - we will.

Sigrid Keydana, November 12, 2012 - 11:00 am UTC

Hi Tom,

many thanks for your answer!

1) I've simply set up a scheduler job that starts every hour, calling DBMS_STATS.GATHER_SYSTEM_STATS ('interval',59, 'systemstats', 'W_' || to_char(sysdate, 'yyyy_mm_dd_hh24_mi'), 'schemaname');

Up till now, I have 465 intervals and the following results:
avg stddev count
sreadtim 1,59 2,87 376
mreadtim 0,63 1,66 241
mbrc 2,22 1,78 23
all intervals 456

So, I have 241 of 465 intervals where Oracle took a measurement of mreadtim, but only 23 where mbrc was measured.

2) I wonder what you'd recommend: using
- the average measurement results listed above, possibly calculating not the average over all intervals, but for the "most relevant" time periods only
- using noworkload stats
- storage vendor data: here my colleagues tell me that this would depend on the individual storage configuration (what disks you use, how many spindles, etc.)

3) thanks, I will definitely do the trace. By port-specific you mean, dependent on to what OS Oracle has been ported, right? Because what made me wonder is this citation from the ASM administration guide:
"Oracle recommends that the allocation unit (AU) size for a disk group be set to 4 megabytes (MB). In addition to this AU size recommendation, the operating system (OS) I/O size should be set to the largest possible size."
This sounds like you could somehow influence the OS I/O size...

4) when you say mbrc here, you mean db_file_multiblock_read_count, right - or whatever Oracle does when I have not set that?
And when you say "don't have mreadtim less then sreadtim, if you do, it isn't used - we'll compute one from the sreadtim, block size and mbrc." I think we arrive at a fourth alternative for point 4) above, don't we: we just set sreadtim in aux_stats$, and Oracle will figure out mreadtim and mbrc? Would you recommend doing that?

Many thanks again
Sigrid




Tom Kyte
November 14, 2012 - 10:11 am UTC

truth be told, I don't usually mess with these at all - just let them all default. They subtly affect the costing of queries by affecting the costs of single/multi-block reads.

I would tend to let them "be" - you'll be in the company of most of the world, you'll be using what we use when we regression test the database.

I would suspect (not confirmed) that many of your observations had a single (or few) multiblock read event and we used the time of that, but didn't care to count the multi-block-read count, just not enough information. I don't have time right now (on the road) to test out that hypothesis - it could be wrong. I don't think I've observed that in the past.

by port specific - the max IO size is limited by the OS and could vary by OS, it would be OS by OS you'd have to look at this (sort of outside of my area..)

Sigrid Keydana, November 12, 2012 - 11:13 am UTC

Tom,

I'm also trying to verify the above measurements against dba_hist_filestatxs (over the same time period).
I'd like to post the query I'm using for that:

with
/*
PHYRDS Number of reads
PHYBLKRD Number of blocks read
SINGLEBLKRDS Number of single block reads = number of single blocks read
sreadtim = SINGLEBLKRDTIM / SINGLEBLKRDS
mreadtim = (PHYBLKRD - SINGLEBLKRDS) / (READTIM - SINGLEBLKRDTIM)
mbrc = (PHYBLKRD - SINGLEBLKRDS) / (PHYRDS - SINGLEBLKRDS)
*/
rawdata as
(select snap_id, instance_number, begin_interval_time, filename,
phyblkrd, singleblkrds, readtim, singleblkrdtim, phyrds,
phyblkrd - lag(phyblkrd) over(partition by instance_number, filename order by snap_id ) block_read_delta,
singleblkrds - lag(singleblkrds) over(partition by instance_number, filename order by snap_id ) sb_read_delta,
readtim - lag(readtim) over(partition by instance_number, filename order by snap_id ) read_time_cs_delta,
singleblkrdtim - lag(singleblkrdtim) over(partition by instance_number, filename order by snap_id) sb_read_time_cs_delta,
phyrds - lag(phyrds) over(partition by instance_number, filename order by snap_id) phyrd_delta
from dba_hist_filestatxs join dba_hist_snapshot using (snap_id, instance_number)
where begin_interval_time between
to_timestamp('10-24-2012 17:00', 'mm-dd-yyyy hh24:mi') and to_timestamp('11-12-2012 17:00', 'mm-dd-yyyy hh24:mi')
order by snap_id, instance_number, filename),
rawdeltas as
(select snap_id, instance_number, begin_interval_time, filename,
(block_read_delta - sb_read_delta) non_sb_read_delta,
sb_read_delta,
(read_time_cs_delta - sb_read_time_cs_delta) non_sb_read_time_cs_delta,
sb_read_time_cs_delta,
(phyrd_delta - sb_read_delta) non_sb_rd_delta
from rawdata where (block_read_delta is not null and sb_read_delta is not null and read_time_cs_delta is not null and sb_read_time_cs_delta is not null and phyrd_delta is not null)
and ((block_read_delta - sb_read_delta) > 0 and sb_read_delta > 0
and (read_time_cs_delta - sb_read_time_cs_delta) > 0
and sb_read_time_cs_delta > 0 and phyrd_delta > 0 and (phyrd_delta - sb_read_delta) > 0)),
overall as
(select sum(non_sb_read_delta) non_sb_reads_overall,
sum(sb_read_delta) sb_reads_overall,
sum(non_sb_read_time_cs_delta) non_sb_read_time_overall,
sum(sb_read_time_cs_delta) sb_read_time_overall,
round(10 * avg(sb_read_time_cs_delta/sb_read_delta), 2) sbreadtim,

round(10 * avg(non_sb_read_time_cs_delta/non_sb_read_delta),2) non_sbreadtim,
round(avg (non_sb_read_delta/non_sb_rd_delta), 2) non_sb_rc_overall

from rawdeltas)
select * from overall;

These are the results I'm getting:

sbreadtim 3,12
non-sbreadtim 1,59
non-sb-rc-overall ("mbrc") 7,73

With my query above, I see at least one reason for getting different results - the AWR snaps are taken half-hourly, not hourly, so I'm averaging over different time periods.
But still, the results are quite different, especially regarding mbrc.
Might I ask your opinion what to conclude?

Thanks again
Sigrid
Tom Kyte
November 14, 2012 - 10:14 am UTC

i'd let things default...

I cannot right now reverse engineer your query to validate it - sorry...

More to Explore

Performance

Get all the information about database performance in the Database Performance guide.