Skip to Main Content

Breadcrumb

Question and Answer

Tom Kyte

Thanks for the question, Raj.

Asked: December 12, 2004 - 8:08 pm UTC

Last updated: December 16, 2004 - 8:40 am UTC

Version: 9.2.0.4

Viewed 1000+ times

You Asked

Tom,
I see lots of difference between number of rows and sample size when I issue compute statistics, anything wrong.

NUM_ROWS SAMPLE_SIZE
---------- -----------
177403134 1121790


and Tom said...

they are showing 2 entirely different things

num_rows = estimated number of rows in the table.

sample_size = physical number of rows inspected to come to that estimation.

seems ok to me.

Rating

  (19 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

A reader, December 13, 2004 - 12:39 am UTC

Thanks Tom, it helps.
Raj

reader

alex, December 13, 2004 - 6:15 am UTC

could u please elaborate a little on the sample_size again ...i am unable to understand that..

Tom Kyte
December 13, 2004 - 10:17 am UTC

In this example, big_table has precisely 1,000,000 rows....


big_table@ORA9IR2> exec dbms_stats.gather_table_stats( user, 'BIG_TABLE', ESTIMATE_PERCENT=>1 );

PL/SQL procedure successfully completed.

big_table@ORA9IR2> select sample_size, num_rows from user_tables where table_name = 'BIG_TABLE';

SAMPLE_SIZE NUM_ROWS
----------- ----------
9804 980400

1% of 1,000,000 is 10,000 -- sample_size reflects the number of rows inspected during the gather, 980,400 represents the estimated TOTAL number of rows in the table based on a 1% sample

big_table@ORA9IR2> exec dbms_stats.gather_table_stats( user, 'BIG_TABLE', ESTIMATE_PERCENT=>5 );

PL/SQL procedure successfully completed.

big_table@ORA9IR2> select sample_size, num_rows from user_tables where table_name = 'BIG_TABLE';

SAMPLE_SIZE NUM_ROWS
----------- ----------
50022 1000440

5% of 1,000,000 is 50,000 -- sample size reflects that. based on a 5% random sample, we believe there are 1,000,440 rows in the table


big_table@ORA9IR2> exec dbms_stats.gather_table_stats( user, 'BIG_TABLE', ESTIMATE_PERCENT=>10 );

PL/SQL procedure successfully completed.

big_table@ORA9IR2> select sample_size, num_rows from user_tables where table_name = 'BIG_TABLE';

SAMPLE_SIZE NUM_ROWS
----------- ----------
100516 1005160

10% is 100,000 -- we sampled 100,516 rows. The estimated number of rows based ona 10% sample is 1,005,160

big_table@ORA9IR2> exec dbms_stats.gather_table_stats( user, 'BIG_TABLE', ESTIMATE_PERCENT=>20 );

PL/SQL procedure successfully completed.

big_table@ORA9IR2> select sample_size, num_rows from user_tables where table_name = 'BIG_TABLE';

SAMPLE_SIZE NUM_ROWS
----------- ----------
200083 1000415

big_table@ORA9IR2> exec dbms_stats.gather_table_stats( user, 'BIG_TABLE', ESTIMATE_PERCENT=>40 );

PL/SQL procedure successfully completed.

big_table@ORA9IR2> select sample_size, num_rows from user_tables where table_name = 'BIG_TABLE';

SAMPLE_SIZE NUM_ROWS
----------- ----------
400139 1000348

big_table@ORA9IR2> exec dbms_stats.gather_table_stats( user, 'BIG_TABLE' );

PL/SQL procedure successfully completed.

big_table@ORA9IR2> select sample_size, num_rows from user_tables where table_name = 'BIG_TABLE';

SAMPLE_SIZE NUM_ROWS
----------- ----------
1000000 1000000

and so on...


Isolated index statistics

Sergey, December 13, 2004 - 8:12 am UTC

Hello Tom,

I need some clarification regarding a situation I met at one of our customers DB. They have a lot of indexes with calculated statistics, but respective tables have none. I wonder how did they manage to achieve it? The only way I could figure out is rather exotic one: 1st they garthered stats on the tables and then deleted it with DBMS_STATS.DELETE_TABLE_STATS ...cascade_indexes => FALSE.
AFAIK stats on indexes without that on tables is pretty useless, at least with optimizer mode 'CHOOSE' it is ignored. Or am I wrong and index stats can be used by Oracle even if there is no respective table stats?

BW

Sergey


Tom Kyte
December 13, 2004 - 10:20 am UTC

hows about gather index stats or analyze index.

having index stats without table stats would not be very useful.

So the statistics were estimated, not computed

Alexander, December 13, 2004 - 9:37 am UTC

So I guess that although the original poster wrote "when I issue compute statistics", in fact ESTIMATE STATISTICS was issued, correct?

Tom Kyte
December 13, 2004 - 10:21 am UTC

ahh, i see that point.

I have to assume that they issued an analyze or dbms_stats command and really did an estimate (eg: they did not compute)

Why the sample sizie is not 10,000

Satish, December 13, 2004 - 11:03 am UTC

"1% of 1,000,000 is 10,000 -- sample_size reflects the number of rows inspected during the gather, 980,400 represents the estimated TOTAL number of rows in the
table based on a 1% sample"
In the above example of yours why doesn't oracle pick 10,000 rows to sample. Why is it 9,804 and why not say 9,805? I thought that Oracle would know how many rows are present in the table and would sample 10,000 rows. The reason for my assumption is if it doesn't know how many rows is present then it doesn't as well know how much is 1%. It seems, obviously, I am missing something. Can you pls throw some light on this ?

Tom Kyte
December 13, 2004 - 1:44 pm UTC



you have a classic chicken and egg problem here if you think about it. you say

"I thought that Oracle would know how many
rows are present in the table and would sample 10,000 rows."

well, if Oracle knew how many rows precisely were in the table, why are we gathering statistics!! In order for Oracle to know "one million", we must have computed!


Basically (conceptually), oracle fly's through the table and generates a random number between 1 and 100 for each row. If random number <= % to estimate, this row is included in the estimate -- else -- not.

But, everything is an estimate here (if you want PRECISENESS, compute -- estimate by it's very name tells us something!)


It full scanned the table, keeping 1% of the rows it looked at by generating a random like number between 1 and 100 and saying 'are you less than or equal to 1'.

One a large enough set -- that generates a 1% random sample.

Clarification

Raj, December 13, 2004 - 1:37 pm UTC

Tom,
Just to make it clear, I issued dbms_stats to compute full
table and index stats only and didn't attempt for an estimate.Thats why I asked anything wrong in the server process.

Tom Kyte
December 13, 2004 - 2:13 pm UTC

show us a cut and paste of every step -- the dbms_stats, plus the query to show this output.

Is number of BLOCKS accurate?

Gabe, December 13, 2004 - 2:08 pm UTC

To Satish ...

<quote>In the above example of yours why doesn't oracle pick 10,000 rows to sample</quote>

Sample Percent indicates the number of blocks to sample ... not number of rows. The number of rows is derived from the 10% sample of blocks and then extrapolated to estimate the total number of rows in the table.

So the question is ... does Oracle know the total number of blocks in the table? I would guess it doesn't since USER_TABLES.BLOCKS is only populated after an analyze.

Tom ... I don't quite follow this ...

<quote>Basically (conceptually), oracle fly's through the table and generates a random number between 1 and 100 for each row. If random number <= % to estimate, this row is included in the estimate -- else -- not.</quote>

because, it implies each row is _visited_ ... hence the total number would be known.

What I noticed is ...

1. Upon "create table T as select * from all_objects" ... BLOCKS, NUM_ROWS, SAMPLE_SIZE are empty ... OK

2. Successive dbms_stats.gather_table_stats for T with samples 20, 60, 100 ... the NUM_ROWS is only accurate after SAMPLE 100% ... but BLOCKS is accurate after the first collection with SAMPLE 20% ... and hence my question:

Can Oracle accurately calculate the number of BLOCKS with the first analyze call? Or it was just a _lucky_ estimate?



Tom Kyte
December 13, 2004 - 2:37 pm UTC

dbms_stats defaults BLOCK_SAMPLE to FALSE. we are row sampling. I did not block sample.

The sample percent is the number of rows to sample.......


We do however know the exact number of blocks in the table (full scans would not work otherwise!)






(conceptually)


is the keyword -- we use the SAMPLE clause on the query -- my 10% sample resulted in a query like this:


select /*+ cursor_sharing_exact dynamic_sampling(0) no_monitoring*/ count(*)
,sum(vsize("ID")),substrb(dump(min("ID"),16,0,32),1,120),
substrb(dump(max("ID"),16,0,32),1,120),count(distinct "OWNER"),
sum(vsize("OWNER")),substrb(dump(min("OWNER"),16,0,32),1,120),
substrb(dump(max("OWNER"),16,0,32),1,120),count(distinct "OBJECT_NAME"),
sum(vsize("OBJECT_NAME")),substrb(dump(min("OBJECT_NAME"),16,0,32),1,120),
substrb(dump(max("OBJECT_NAME"),16,0,32),1,120),count("SUBOBJECT_NAME"),
count(distinct "SUBOBJECT_NAME"),sum(vsize("SUBOBJECT_NAME")),
substrb(dump(min("SUBOBJECT_NAME"),16,0,32),1,120),
substrb(dump(max("SUBOBJECT_NAME"),16,0,32),1,120),count(distinct
"OBJECT_ID"),sum(vsize("OBJECT_ID")),substrb(dump(min("OBJECT_ID"),16,0,32),
1,120),substrb(dump(max("OBJECT_ID"),16,0,32),1,120),count("DATA_OBJECT_ID")
,count(distinct "DATA_OBJECT_ID"),sum(vsize("DATA_OBJECT_ID")),
substrb(dump(min("DATA_OBJECT_ID"),16,0,32),1,120),
substrb(dump(max("DATA_OBJECT_ID"),16,0,32),1,120),count("OBJECT_TYPE"),
count(distinct "OBJECT_TYPE"),sum(vsize("OBJECT_TYPE")),
substrb(dump(min("OBJECT_TYPE"),16,0,32),1,120),
substrb(dump(max("OBJECT_TYPE"),16,0,32),1,120),count(distinct "CREATED"),
substrb(dump(min("CREATED"),16,0,32),1,120),substrb(dump(max("CREATED"),16,
0,32),1,120),count(distinct "LAST_DDL_TIME"),
substrb(dump(min("LAST_DDL_TIME"),16,0,32),1,120),
substrb(dump(max("LAST_DDL_TIME"),16,0,32),1,120),count("TIMESTAMP"),
count(distinct "TIMESTAMP"),sum(vsize("TIMESTAMP")),
substrb(dump(min("TIMESTAMP"),16,0,32),1,120),substrb(dump(max("TIMESTAMP"),
16,0,32),1,120),count("STATUS"),count(distinct "STATUS"),sum(vsize("STATUS")
),substrb(dump(min("STATUS"),16,0,32),1,120),substrb(dump(max("STATUS"),16,
0,32),1,120),count("TEMPORARY"),count(distinct "TEMPORARY"),
sum(vsize("TEMPORARY")),substrb(dump(min("TEMPORARY"),16,0,32),1,120),
substrb(dump(max("TEMPORARY"),16,0,32),1,120),count("GENERATED"),
count(distinct "GENERATED"),sum(vsize("GENERATED")),
substrb(dump(min("GENERATED"),16,0,32),1,120),substrb(dump(max("GENERATED"),
16,0,32),1,120),count("SECONDARY"),count(distinct "SECONDARY"),
sum(vsize("SECONDARY")),substrb(dump(min("SECONDARY"),16,0,32),1,120),
substrb(dump(max("SECONDARY"),16,0,32),1,120)
from
"BIG_TABLE"."BIG_TABLE" sample ( 10.0000000000) t


when you set block_sample=>true, we use this:

...
from
"BIG_TABLE"."BIG_TABLE" sample block ( 10.0000000000) t



So, you see the query itself is the thing doing the "get me a random number, should we see this row or not -- if so output it". The guy getting the row output -- he is the one doing the estimate, he does not know "1,000,000".

When we do a block sample, we read the extent map and figure out what we are going to full scan. Then we do that random number thing block by block -- "do we read this block or not"


So, the SAMPLE clause is the thing that flies through the table -- dbms_stats, the thing doing the estimation, gets the resulting rows but doesn't know how many rows were there really.


1) because no one would use them....

2) num_rows could be "lucky" and be accurate with an estimate

the number of blocks is always a well known thing -- it just isn't populated in the dictionary views until it needs to be (available via dbms_space - what I use in my showspace script)...

clarification

Raj, December 13, 2004 - 2:49 pm UTC

The big table is of size 26G

I issued the following once the indexes are created.

exec dbms_stats.gather_table_stats('owner','BIG_TABLE',NULL,NULL,FALSE,'FOR ALL COLUMNS SIZE 1',2,'ALL',TRUE,NULL,NULL,NULL,FALSE);

Once the analyze is done, I did the following select

SELECT index_name,num_rows, sample_size FROM DBA_INDEXES
WHERE table_name='BIG_TABLE'

and the result is,

idx1,186772425,1393186
idx2,174709765,1538445
idx3,166264094,2129274
idx4,2041023,2500
idx5,1554865,2500
idx6,177403134,1121790



Tom Kyte
December 13, 2004 - 2:58 pm UTC

I cannot reproduce. check your last analyzed and make sure table, indexes, et.al. are all of the same time.


big_table@ORA9IR2> begin
2 dbms_stats.gather_table_stats
3 (ownname => user,
4 tabname => 'BIG_TABLE',
5 partname=>NULL,
6 estimate_percent=>NULL,
7 block_sample => FALSE,
8 method_opt => 'FOR ALL COLUMNS SIZE 1',
9 degree => 2,
10 granularity=>'ALL',
11 cascade =>TRUE,
12 stattab=>NULL,
13 statid=>NULL,
14 statown =>NULL,
15 no_invalidate=>FALSE);
16 end;
17 /

PL/SQL procedure successfully completed.

big_table@ORA9IR2> select index_name,num_rows, sample_size FROM DBA_INDEXES
2 WHERE table_name='BIG_TABLE';

INDEX_NAME NUM_ROWS SAMPLE_SIZE
------------------------------ ---------- -----------
BIG_TABLE_PK 1000000 1000000


Clarification

Raj, December 13, 2004 - 3:51 pm UTC

Tom,
All are done at the same time.May be its a bug in 9.2.0.4
working with oracle support in getting a patch if available any.

What percentage of estimate I can go for by not effecting any performance. Is it 50%, 75%. What is the number.

Tom Kyte
December 13, 2004 - 3:56 pm UTC

why cannot you run the query for us and show us?

just verify the last analyzed are all the same.

clarification

Raj, December 13, 2004 - 4:04 pm UTC

OK, here is the output.

SELECT index_name,num_rows, sample_size,last_analyzed FROM DBA_INDEXES
WHERE table_name='ACX_MATCH'

IDX1_ACX_MATCH_SXLFNAME_1,174709765,1538445,12/10/2004 2:06:08 PM
IDX1_ACX_MATCH_PPNUMBER1,166264094,2129274,12/10/2004 2:12:35 PM
IDX1_ACX_MATCH_CKT_LONG_RND1,2041023,2500,12/10/2004 2:26:16 PM
IDX1_ACX_MATCH_CKT_LAT_RND1,1554865,2500,12/10/2004 2:39:31 PM


Tom Kyte
December 13, 2004 - 6:58 pm UTC

Ok, i've managed to partially reproduce *sporadically* 

ops$tkyte@ORA9IR2> create table big_table
  2  as
  3  select *
  4    from big_table.big_table;
 
Table created.
 
ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2> alter table big_table add constraint big_table_pk primary key(id);
 
Table altered.
 
ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2> create index t_idx1 on big_table(owner);
 
Index created.
 
ops$tkyte@ORA9IR2> exec dbms_stats.gather_index_stats( user, 'T_IDX1', estimate_percent => 10 );
 
PL/SQL procedure successfully completed.
 
ops$tkyte@ORA9IR2> create index t_idx2 on big_table(object_name);
 
Index created.
 
ops$tkyte@ORA9IR2> exec dbms_stats.gather_index_stats( user, 'T_IDX2', estimate_percent => 10 );
 
PL/SQL procedure successfully completed.
 
ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2> select index_name, num_rows, sample_size , last_analyzed, sysdate
  2    from user_indexes
  3   where table_name = 'BIG_TABLE';
 
INDEX_NAME                       NUM_ROWS SAMPLE_SIZE LAST_ANALYZED
------------------------------ ---------- ----------- --------------------
SYSDATE
--------------------
BIG_TABLE_PK
13-dec-2004 18:49:12
 
T_IDX1                             999378      496555 13-dec-2004 18:48:44
13-dec-2004 18:49:12
 
T_IDX2                             985001      230640 13-dec-2004 18:49:12
13-dec-2004 18:49:12
 
 
ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2> begin
  2  dbms_stats.gather_table_stats
  3  (ownname => user,
  4  tabname => 'BIG_TABLE',
  5  partname=>NULL,
  6  estimate_percent=>NULL,
  7  block_sample => FALSE,
  8  method_opt => 'FOR ALL COLUMNS SIZE 1',
  9  degree => 2,
 10  granularity=>'ALL',
 11  cascade =>TRUE,
 12  stattab=>NULL,
 13  statid=>NULL,
 14  statown =>NULL,
 15  no_invalidate=>FALSE);
 16  end;
 17  /
 
PL/SQL procedure successfully completed.
 
ops$tkyte@ORA9IR2> select index_name, num_rows, sample_size , last_analyzed
  2    from user_indexes
  3   where table_name = 'BIG_TABLE';
 
INDEX_NAME                       NUM_ROWS SAMPLE_SIZE LAST_ANALYZED
------------------------------ ---------- ----------- --------------------
BIG_TABLE_PK                      1000000     1000000 13-dec-2004 18:51:43
T_IDX1                            1000000     1000000 13-dec-2004 18:51:46
T_IDX2                             985001      230640 13-dec-2004 18:49:12
 
ops$tkyte@ORA9IR2>


<b>see how it skipped the last one -- didn't analyze it</b>

but if I pause before the big analyze, it doesn't do that. (it took me many many runs to get that to happen for some reason -- seems timing related, hard to reproduce time after time)

What is the last analyzed time of the table here?

we are getting closer -- can you tell me precisely what happens before this analyze? 

Got it ...

Gabe, December 13, 2004 - 4:41 pm UTC

I missed that block_sample is defaulted to FALSE.

I would be curious to know how the sample [row] table scan works ... is it in fact scanning (information about) all rows?

Or is it doing (conceptually :) something like ...

select rid from
(select rowid rid,mod(rownum,100/<sample_size>) r from t)
where r=0;

On a different note ... what is SAMPLE (X,Y) doing? ... couldn't find (or didn't know how to search for) info in the documentation.


Tom Kyte
December 13, 2004 - 5:42 pm UTC

it is conceptually like

select *
from (
select a.*, trunc(dbms_random.value( 0, 1 ) * 100 ) rnd
from a
)
where rnd <= :pct;




it is sample(N) (they had 10.00000000000000)

</code> http://docs.oracle.com/docs/cd/B10501_01/server.920/a96540/statements_103a.htm#2065954 <code>

clarification

Raj, December 13, 2004 - 7:41 pm UTC

I have 2 servers one is ORACLE 9.2.0.5 on Solaris and the other one is 9.2.0.4 on Linux.
The following is the result from 0.5, looks fine.
I faced the same issue when I was on 9.2.0.4 or .0.3 I don't remember correctly.
IDX2_EXP_QP_DATA1_HPROPIND, 9769,9769, 11/16/2004 2:10:25 PM
IDX2_EXP_QP_DATA1_LONG_LAT_RND, 106071980,106071980, 11/16/2004 2:10:25 PM
IDX2_EXP_QP_DATA1_MARTIAL_1, 8520,8520, 11/16/2004 2:10:25 PM
IDX2_EXP_QP_DATA1_MOSAICGRP,2 2040,22040, 11/16/2004 2:10:25 PM
IDX2_EXP_QP_DATA1_MSA, 4359,4359, 11/16/2004 2:10:25 PM
IDX2_EXP_QP_DATA1_PC, 4862,4862, 11/16/2004 2:10:25 PM
IDX2_EXP_QP_DATA1_PE1, 16355,16355, 11/16/2004 2:10:25 PM
IDX2_EXP_QP_DATA1_ZIP, 106071980,106071980, 11/16/2004 2:10:25 PM
IDX2_EXP_QP_DATA1_ZIP_PLUS4, 98901,98901, 11/16/2004 2:10:25 PM

The following is the result from 9.2.0.4 on Linux,(I did a table compression here)
IDX2_ER_QP_DATA1_HPROPIND, 2106,2106,12/5/2004 11:30:51 PM
IDX2_ER_QP_DATA1_LONG_LAT_RND, 103802668,1676791,12/5/2004 11:31:17 PM
IDX2_ER_QP_DATA1_MARTIAL_1, 1835,1835,12/5/2004 11:33:24 PM
IDX2_ER_QP_DATA1_MOSAICGRP, 4819,4819,12/5/2004 11:35:50 PM
IDX2_ER_QP_DATA1_MSA, 939,939,12/5/2004 11:37:19 PM
IDX2_ER_QP_DATA1_PC, 1040,1040,12/5/2004 11:39:14 PM
IDX2_ER_QP_DATA1_PE1, 3455,3455,12/5/2004 11:41:33 PM
IDX2_ER_QP_DATA1_ZIP, 111971777,2314528,12/5/2004 11:50:01 PM
IDX2_ER_QP_DATA1_ZIP_PLUS4, 22131,2430,12/5/2004 11:50:27 PM
Both the places I created the indexes and immediately executed the dbms_stats.

I think I can reproduce this any number of times, as I am seeing the same result across most of the tables if the column length is large and the num_rows are more.



Tom Kyte
December 13, 2004 - 7:59 pm UTC

I concurr on the width of the table and size -- but still believe it to be timing related.


you are showing indexes, but not the tables?

sample (x,y)

A reader, December 13, 2004 - 7:44 pm UTC

The SQL reference manual is the first place I looked ... it explains what N in SAMPLE (N) is.

What I found is that both of these work ...

select * from t sample (1.2); <== 1.2% sample ... no problems here
select * from t sample (1,2); <== not sure what this is

What I found mentioned in the documentation is SAMPLE BLOCK(x,y)

</code> http://download-west.oracle.com/docs/cd/B10501_01/server.920/a96533/build_db.htm#15788 <code>

Couldn't find a proper definition of this.


Tom Kyte
December 14, 2004 - 8:48 am UTC

It wasn't supposed to be published -- I'll file the doc bug to have it removed.

it represents the number of contigous blocks or rows that make up a sample. It was part of the compression project (where we have lots more rows fitting on a single block). It increases performance, decreases randomness.

In anycase, not supported.

clarification

Raj, December 13, 2004 - 10:53 pm UTC

Table 1 from the above,
CREATE TABLE EXP_QP_DATA1
(
STATE_ABBR VARCHAR2(2 BYTE),
COUNTY VARCHAR2(3 BYTE),
COUNTY_FIPS VARCHAR2(5 BYTE),
MSA NUMBER(6),
CITY_NAME VARCHAR2(28 BYTE),
ZIP VARCHAR2(5 BYTE),
ZIP_PLUS4 VARCHAR2(4 BYTE),
CHILDREN_AGE_0_3 VARCHAR2(1 BYTE),
CHILDREN_AGE_4_6 VARCHAR2(1 BYTE),
CHILDREN_AGE_7_9 VARCHAR2(1 BYTE),
CHILDREN_AGE_10_12 VARCHAR2(1 BYTE),
CHILDREN_AGE_13_18 VARCHAR2(1 BYTE),
PERSON_GENDER_1 VARCHAR2(1 BYTE),
CKT_AGE_CODE VARCHAR2(1 BYTE),
PERSON_OCC_1 VARCHAR2(2 BYTE),
CKT_HOMEOWNER VARCHAR2(1 BYTE),
CKT_LOR VARCHAR2(1 BYTE),
DWELLING_TYPE VARCHAR2(1 BYTE),
PERSON_MARTIAL_1 VARCHAR2(1 BYTE),
MAIL_RESPONDER VARCHAR2(1 BYTE),
CKT_HOME_VAL_COMB VARCHAR2(1 BYTE),
PERSON_EDUCATION_1 VARCHAR2(1 BYTE),
PRES_OF_CREDIT_CARD VARCHAR2(1 BYTE),
CKT_CHILDREN VARCHAR2(1 BYTE),
INCOME_EST_NEW VARCHAR2(1 BYTE),
HOME_PROPERTY_IND VARCHAR2(2 BYTE),
WEALTH_RATING VARCHAR2(1 BYTE),
PERSON_ETHNIC_1 VARCHAR2(2 BYTE),
IN_THE_MARKET_MODEL VARCHAR2(1 BYTE),
PERSONAL_COMPUTER VARCHAR2(1 BYTE),
HOME_BUSINESS VARCHAR2(1 BYTE),
MOSAIC_GROUP VARCHAR2(3 BYTE),
CKT_LAT_RND NUMBER,
CKT_LONG_RND NUMBER
)

Table 2 from the above
CREATE TABLE ER_QP_DATA1
(
STATE_ABBR VARCHAR2(2 BYTE),
COUNTY VARCHAR2(3 BYTE),
COUNTY_FIPS VARCHAR2(5 BYTE),
MSA NUMBER(6),
CITY_NAME VARCHAR2(28 BYTE),
ZIP VARCHAR2(5 BYTE),
ZIP_PLUS4 VARCHAR2(4 BYTE),
CHILDREN_AGE_0_3 VARCHAR2(1 BYTE),
CHILDREN_AGE_4_6 VARCHAR2(1 BYTE),
CHILDREN_AGE_7_9 VARCHAR2(1 BYTE),
CHILDREN_AGE_10_12 VARCHAR2(1 BYTE),
CHILDREN_AGE_13_18 VARCHAR2(1 BYTE),
PERSON_GENDER_1 VARCHAR2(1 BYTE),
CKT_AGE_CODE VARCHAR2(1 BYTE),
PERSON_OCC_1 VARCHAR2(2 BYTE),
CKT_HOMEOWNER VARCHAR2(1 BYTE),
CKT_LOR VARCHAR2(1 BYTE),
DWELLING_TYPE VARCHAR2(1 BYTE),
PERSON_MARTIAL_1 VARCHAR2(1 BYTE),
MAIL_RESPONDER VARCHAR2(1 BYTE),
CKT_HOME_VAL_COMB VARCHAR2(1 BYTE),
PERSON_EDUCATION_1 VARCHAR2(1 BYTE),
PRES_OF_CREDIT_CARD VARCHAR2(1 BYTE),
CKT_CHILDREN VARCHAR2(1 BYTE),
INCOME_EST_NEW VARCHAR2(1 BYTE),
HOME_PROPERTY_IND VARCHAR2(2 BYTE),
WEALTH_RATING VARCHAR2(1 BYTE),
PERSON_ETHNIC_1 VARCHAR2(2 BYTE),
IN_THE_MARKET_MODEL VARCHAR2(1 BYTE),
PERSONAL_COMPUTER VARCHAR2(1 BYTE),
HOME_BUSINESS VARCHAR2(1 BYTE),
MOSAIC_GROUP VARCHAR2(3 BYTE),
CKT_LAT_RND NUMBER,
CKT_LONG_RND NUMBER
)

Tom Kyte
December 14, 2004 - 8:55 am UTC

didn't need the table defs -- wanted the information you were showing for the indexes -- last analyzed etc...

Row sampling should give you exact number of rows

Alexander, December 14, 2004 - 10:28 am UTC

Tom,
since row sampling works like this:
"Basically (conceptually), oracle fly's through the table and generates a random
number between 1 and 100 for each row. If random number <= % to estimate, this
row is included in the estimate -- else -- not."

then after row sampling is finished, we know the EXACT number of rows, correct? Then I'd say estimated number of rows equals to the exact one for row sampling

For instance, when I run
ANALYZE TABLE asdfasdfas ESTIMATE STATISTICS SAMPLE 10 PERCENT,
I always get exact number of rows in NUM_ROWS.

However, in your examples NUM_ROWS is different from exact number of rows and you are using row sampling. How can that happen?

Tom Kyte
December 14, 2004 - 11:05 am UTC

I said *conceptually* -- "so you can understand what is going on, pretend it does this verbaitim".

*conceptually*


see above where we talked about the queries:

...
from
"BIG_TABLE"."BIG_TABLE" sample ( 10.0000000000) t


when you set block_sample=>true, we use this:

...
from
"BIG_TABLE"."BIG_TABLE" sample block ( 10.0000000000) t
............


we are using sampling -- it is as if we visit each row -- but we don't really.

Clarification

Raj, December 14, 2004 - 11:34 am UTC

Tom,
Here are the table details,

OWNER,TABLE_NAME,TABLESPACE_NAME,CLUSTER_NAME,IOT_NAME,PCT_FREE,PCT_USED,INI_TRANS,MAX_TRANS,INITIAL_EXTENT,NEXT_EXTENT,MIN_EXTENTS,MAX_EXTENTS,PCT_INCREASE,FREELISTS,FREELIST_GROUPS,LOGGING,BACKED_UP,NUM_ROWS,BLOCKS,EMPTY_BLOCKS,AVG_SPACE,CHAIN_CNT,AVG_ROW_LEN,AVG_SPACE_FREELIST_BLOCKS,NUM_FREELIST_BLOCKS,DEGREE,INSTANCES,CACHE,TABLE_LOCK,SAMPLE_SIZE,LAST_ANALYZED,PARTITIONED,IOT_TYPE,TEMPORARY,SECONDARY,NESTED,BUFFER_POOL,ROW_MOVEMENT,GLOBAL_STATS,USER_STATS,DURATION,SKIP_CORRUPT,MONITORING,CLUSTER_OWNER,DEPENDENCIES,COMPRESSION
own1,BIG_TBL1,ts1,,,10,,1,255,1048576,,1,2147483645,,,,NO,N,106071980,381456,0,0,0,99,0,0, 1, 1, N,ENABLED,106071980,11/11/2004 11:37:25 AM,NO,,N,N,NO,DEFAULT,DISABLED,YES,NO,,DISABLED,NO,,DISABLED,DISABLED
own2,BIG_TBL2,ts2,,,0,,1,255,1048576,,1,2147483645,,,,NO,N,106071980,161577,0,0,0,99,0,0, 1, 1, N,ENABLED,106071980,12/5/2004 11:17:35 PM,NO,,N,N,NO,DEFAULT,DISABLED,YES,NO,,DISABLED,NO,,DISABLED,ENABLED


Tom Kyte
December 15, 2004 - 12:31 pm UTC

I cannot make heads or tails of that.

what I wanted was "hey, just the same thing you showed for the indexes -- show us the name, last analyzed etc ALONG WITH the table information.

Estimate/SAMPLE is about CPU, not I/O

Vladimir Andreev, December 14, 2004 - 12:05 pm UTC

In general, it might look surprising and counter-intuitive, but "estimate" and "sample rows" help you save on CPU, not on I/O! That is, one might think that using "estimate statistics sample 1 percent" will only read 1 percent of the table, which is not the case:

analyze table big_table compute statistics for table


call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.01 0 4 0 0
Execute 1 7.25 7.09 12652 13409 2 0
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 2 7.25 7.10 12652 13413 2 0

Misses in library cache during parse: 1
Optimizer goal: CHOOSE
Parsing user id: 40

Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
db file scattered read 848 0.00 0.12
db file sequential read 116 0.00 0.00
log file sync 1 0.00 0.00
SQL*Net message to client 1 0.00 0.00
SQL*Net message from client 1 7.88 7.88
********************************************************************************

Note how the time spent waiting for I/O is 0.12 seconds, i.e., only 1.7% of the total elapsed time!
When we compare that to a 1%-sample:

analyze table big_table estimate statistics sample 1 percent for table


call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.36 0.35 12364 10065 2 0
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 2 0.36 0.35 12364 10065 2 0

Misses in library cache during parse: 1
Optimizer goal: CHOOSE
Parsing user id: 40

Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
db file scattered read 840 0.00 0.12
db file sequential read 98 0.00 0.00
log file sync 1 0.00 0.00
SQL*Net message to client 1 0.00 0.00
SQL*Net message from client 1 33.78 33.78

we see the same approximate number of logical reads, i.e., we haven't saved any I/O. What we have saved big-time, is CPU time - 0.36s vs. 7.25s. And, I repeat, the savings do *not* come from reduced LIO.

HTH
Flado

P.S.: To Alexander: "Estimate" should give you the correct number of rows only when dbms_random.value has perfect uniform distribution. By using "estimate", you trade resolution for performance: If your sample size was 1 percent, and it read N rows, your num_rows will be N*100, i.e., a nice round number no matter what N is, and N will vary slightly, it will be *around* 1 percent of the rows. So if dbms_random.value (see Tom's 'copceptual' loop above) happens to be less than 1 for 0.9% (or 1.1%) of your rows, you will get num_rows set to 90% (resp. 110%) of the actual row count. That's called 'error magnification', I believe.

Fl.

Tom Kyte
December 15, 2004 - 12:35 pm UTC

depends on whether you use block sampling or row sampling!


big_table@ORA9IR2> alter session set events '10046 trace name context forever, level 12';

Session altered.

big_table@ORA9IR2> exec dbms_stats.gather_table_stats( user, 'BIG_TABLE', estimate_percent=> 10, block_sample=> FALSE);


OVERALL TOTALS FOR ALL RECURSIVE STATEMENTS

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 35 0.03 0.04 0 0 0 0
Execute 127 0.10 0.11 4 99 303 91
Fetch 164 2.62 3.31 14367 15247 0 149
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 326 2.75 3.47 14371 15346 303 240



but set block_sample=> TRUE and


OVERALL TOTALS FOR ALL RECURSIVE STATEMENTS

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 14 0.01 0.00 0 0 0 0
Execute 82 0.06 0.09 0 34 19 15
Fetch 95 2.26 2.21 6816 2277 0 36
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 191 2.33 2.31 6816 2311 19 51



clarification

Raj, December 15, 2004 - 8:08 pm UTC

Tom,
Here is are the details, sorry for the delay I was stuck in some production problems.
Pl. let me know your thoughts.

Table1(problematic):Linux ,oracle 9.2.0.4

SELECT OWNER,TABLE_NAME,NUM_ROWS,SAMPLE_SIZE,LAST_ANALYZED FROM DBA_TABLES
WHERE table_name='BIG_TBL'

owner1,BIG_TBL,106071980,106071980,12/5/2004 11:17:35 PM

Index:

SELECT OWNER,TABLE_NAME,index_name,NUM_ROWS,SAMPLE_SIZE,LAST_ANALYZED FROM DBA_INDEXES
WHERE table_name='BIG_TBL'

owner1,BIG_TBL,IDX1_BIG_TBL, 103802668,1676791,12/5/2004 11:31:17 PM
owner1,BIG_TBL,IDX2_BIG_TBL, 2307,2307, 12/5/2004 11:45:35 PM
owner1,BIG_TBL,IDX3_BIG_TBL, 968,968, 12/5/2004 11:47:27 PM
owner1,BIG_TBL,IDX4_BIG_TBL, 3774,3774, 12/5/2004 11:49:44 PM
owner1,BIG_TBL,IDX5_BIG_TBL, 111971777,2314528,12/5/2004 11:50:01 PM

Table on Solaris, oracle 9.2.0.5

SELECT OWNER,TABLE_NAME,NUM_ROWS,SAMPLE_SIZE,LAST_ANALYZED FROM DBA_TABLES
WHERE table_name='BIG_TBL1'

OWNER2,BIG_TBL1,106071980,106071980,11/11/2004 11:37:25 AM

Index:

SELECT OWNER,TABLE_NAME,index_name,NUM_ROWS,SAMPLE_SIZE,LAST_ANALYZED FROM DBA_INDEXES
WHERE table_name='BIG_TBL1'

OWNER2,BIG_TBL1,IDX1_BIG_TBL1, 106071980,106071980,11/11/2004 3:49:07 PM
OWNER2,BIG_TBL1,IDX2_BIG_TBL1, 2715,2715, 11/11/2004 12:12:12 PM
OWNER2,BIG_TBL1,IDX3_BIG_TBL1, 1129,1129, 11/11/2004 12:18:41 PM
OWNER2,BIG_TBL1,IDX4_BIG_TBL1, 4334,4334, 11/11/2004 12:26:21 PM
OWNER2,BIG_TBL1,IDX5_BIG_TBL1, 106071980,106071980,11/11/2004 12:36:30 PM


Tom Kyte
December 16, 2004 - 8:00 am UTC

Ok, can you do this

a) delete ALL existing statistics
b) make sure ALL existing statistics are gone
c) then gather schema stats

what I believe might be happening (i cannot confirm this) is that it is for some reason picking up how the indexes were analyzed at one time in the past -- eg: at some point, they were estimated and it is carrying that forward.

this is a *theory*

Attempt at summary of methods for estimate

Vladimir Andreev, December 16, 2004 - 6:25 am UTC

Indeed. With block sample, one risks a further decrease in accuracy, since it now depends on the distribution of rows over blocks in addition to the distribution of column values over rows. But at the same time, block sample provides further performance improvement by reducing LIOs in addition to CPU usage.
In summary, row sample trades accuracy for CPU time, block sample trades even more accuracy for CPU time and I/O.

On another note, this result is only logical, since if one of the methods were inherently superior, the other one would not exist at all :-)

Cheers,
Flado

Tom Kyte
December 16, 2004 - 8:40 am UTC

I think I've said that last paragraph myself once or twice ;)