Skip to Main Content
  • Questions
  • Estimate 100 percent VS compute statistics

Breadcrumb

Question and Answer

Tom Kyte

Thanks for the question, Ravi .

Asked: March 29, 2001 - 4:39 pm UTC

Last updated: September 01, 2011 - 8:23 am UTC

Version: 817

Viewed 10K+ times! This question is

You Asked

Dear Tom,

Is there any difference between

ANALYZE TABLE t_name compute statistics; and
ANALYZE TABLE t_name estimate statistics sample 100 percent;

Oracle manual says that for percentages over 50, oracle always collects exact statistics. Does it mean in the above case, that both are going to take the same time
even for large tables.

I would like to understand,

1) Are the statistics collected in both the cases same,
2) Are the way in which the data obtained and analyzed are same in both the cases
3) Is there any adv. of using one over the other.


Thanks
Ravi

and Tom said...

anything above "sample 49" (eg: sample 50 and above) is the same as

"compute"

it'll do the whole table.


1) yes.
2) yes.
3) no.

Rating

  (18 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Helena Markova, March 30, 2001 - 12:33 am UTC


sample size in analyze

Ramesh, November 30, 2001 - 10:43 am UTC

From ADministrative guide chapter 20 
General Management of schema Objects 
" In either case, if you specify a percentage greater than 50, or a number of rows or index values that is greater than 50% of those in the object, Oracle computes the exact statistics, rather than estimating."

If I specify 
Analyze table table_name estimate statistics sample 50 percent;
It is analyzing like compute statistics.( I checked sample_size with both compute and with sample 50 percent)
If I analyze table with 49 percent the sample size from user_Tables is in no of rows which is not exactly as 49 percent,

09:59:07 kumarr@IDRDEV SQL>analyze table test estimate statistics sample 49 percent;

Table analyzed.

Elapsed: 00:00:00.79
10:04:54 kumarr@IDRDEV SQL> select table_name, sample_size,num_rows from user_tables            
10:05:10   2  where table_name='TEST';

TABLE_NAME                     SAMPLE_SIZE   NUM_ROWS
------------------------------ ----------- ----------
TEST                                 10644      29986

Elapsed: 00:00:00.20
10:05:13 kumarr@IDRDEV SQL> select 10644/29986*100 from dual;

10644/29986*100
---------------
     35.4965651

Please clarify

Thanks in advance
 

Tom Kyte
November 30, 2001 - 12:58 pm UTC

It is an estimate after all, it is never exact. We scan the first couple of blocks of the table to get an estimate of the number of rows in the table all together (so that can skew this very much). We then generate random rowids based on the blocks in the table (from the extents) and remove duplicate rowids from this set (further decreasing the number of rows).

If you want to analyze exactly 49%, you'll need to count the rows and analyze based on rows -- not a sample pct.

anaylze with sample size

Ramesh, November 30, 2001 - 3:03 pm UTC

Even though I calculate 49 % for sample rows and used the figure for estimating , the sample size  in user_tables is differnt . Why ?

14:59:45 kumarr@IDRDEV SQL> select 29986/100*49 from dual;

29986/100*49
------------
    14693.14

Elapsed: 00:00:00.00
15:00:09 kumarr@IDRDEV SQL> analyze table test estimate statistics
15:00:31   2  sample 14694 rows;

Table analyzed.

Elapsed: 00:00:00.77
15:01:20 kumarr@IDRDEV SQL> select sample_size,num_rows from user_tables
15:01:34   2  where table_name='TEST';

SAMPLE_SIZE   NUM_ROWS
----------- ----------
      10694      29986
 

Tom Kyte
November 30, 2001 - 5:34 pm UTC

Ok, i mispoke on the previous response.

Using the sample N rows fixes issues with regards to the sample scan of the first couple of blocks to get the est. number of rows -- you still have the issue that 14,694 rowids were randomly generated - only 10,694 of which were unique.

I doubt you would ever get exactly 49% analyzed for this reason, you would have to over-estimate the number of rows you wanted. its not exact (its an estimate)

dbms_stats.gather_schema_stats: estimate_percent

jan van mourik, November 30, 2001 - 10:19 pm UTC

Tom,

In your answer here:
</code> http://asktom.oracle.com/pls/asktom/f?p=100:11:::::P11_QUESTION_ID:1411208576534 <code>
you use this example:

BEGIN dbms_stats.gather_schema_stats
(user, 99, method_opt =>null, cascade=> true );
end;
/

You set estimate_percent to 99.
Documentation:
Percentage of rows to estimate (NULL means compute): The valid range is [0.000001,100).

Do you have any particular reason to set it to 99 for gather_schema_stats, and not 100 or NULL? Or does a percentage > 49 mean COMPUTE for this proc too?

Thanks, jan


Tom Kyte
December 01, 2001 - 5:30 am UTC

pct > 49 is the same as 100%.

I used 99 because the value range is not inclusive of 100:


1 BEGIN dbms_stats.gather_schema_stats
2 (user, 100, method_opt =>null, cascade=> true );
3* end;
scott@ORA817DEV.US.ORACLE.COM> /
BEGIN dbms_stats.gather_schema_stats
*
ERROR at line 1:
ORA-20001: Illegal sample percent 100: must be in the range [0.000001,100)
ORA-06512: at "SYS.DBMS_STATS", line 4479
ORA-06512: at "SYS.DBMS_STATS", line 4610
ORA-06512: at "SYS.DBMS_STATS", line 4720
ORA-06512: at "SYS.DBMS_STATS", line 4702
ORA-06512: at line 1



Copy Stats

A reader, October 26, 2002 - 5:54 pm UTC

Tom,
Don't know if this sounds stupid enough, but this doubt is really bothering me...

At the development environment we have only a fraction of the actual onsite data. Its very hard to simulate performance problems due to lack of data. My question is that can we simulate the onsite data by loading all the data dictionary tables at the development environment with the same statistics as in Onsite database?

Is this feasible? I do believe that data dictionary tables are non updateable...but can we use dms_stats or someother package for maintaining the tables with the details we provide?

If this is feasible, would this be a good idea.

Thanks.



Tom Kyte
October 27, 2002 - 11:13 am UTC

You know -- this week (Nov 1st) I'm giving a presentation on some peformance stuff and one of the slides reads exactly:

Don't Develop...

- On an empty database
- Importing stats isn’t good enough
- Can you really look at a plan and say “that’s a good one”

- With a single user (quote from asktom below)
“Before the stress test, and when we ran application by itself, it worked
fine. That means SQL statements are committed properly. But somehow
when muti-process access the database resource, it broke.”

- In a dust free lab….

and expect it to work in production!


So, yes, you can import the stats from prod using dbms_stats but I question what this will buy you in the end.

Tell me, can you look at a page of explain plan output and say "yes, that is exactly what we want. When this runs against 1,000,000 rows it will be as good as when it ran against 1,000"


I strongly feel the developers need to develop against live sized data if they want to have any chance of detecting and correcting issues BEFORE rolling out the code.




Can you guide me through this please

A reader, October 27, 2002 - 11:28 am UTC

Tom,

Thanks for the response. Copying stats will never be a 100% substitute for live data. But, in our case getting data is not going to happen and we would have to develop against a fraction of the actual data.

We are plagued by performance problems for big queries on reports. Now this is the last option we have on hand.

1. Can you please let me know to what percent of accuracy would i be getting by copying stats.

2. Also, if i have 5 tables (say), should i copy column by column? Iam not familiar with dbms_stats and got confused over the creation on a table which holds the stats. Which functions should i be using and in what order.

Thanks

Tom Kyte
October 27, 2002 - 11:45 am UTC

But I ask you to ask yourself -- of what use would you find getting the query plans?

Suppose

o you have the stats
o you run a query
o it runs good against "test"
o you look at the plan

now what? will it run good in production? can you tell? how?

You don't have to "get real data", you just have to "generate real enough data".

You will have your plague forever otherwise. Tuning a query against a fraction of the data is virtually worthless. Knowing the query plan generated with stats that are big but data that is small -- not very useful (in my opinion)


You would be getting 100% accuracy in the query plans if you:

o set the init.ora parameters the same on both databases (sort areas, hash areas, sga size, multi-block read count, etc etc etc)

o had the same stats


But I ask you -- so what?


You'll need to read about dbms_stats (fortunately we document these packages!)
</code> http://docs.oracle.com/docs/cd/A87860_01/doc/appdev.817/a76936/dbms_sta.htm#999107

and in particular:

http://docs.oracle.com/docs/cd/A87860_01/doc/appdev.817/a76936/dbms_st2.htm#1010202 <code>

you can probably use export schema stats to get the stats for your entire application schema.











Copy Stats

kumar, March 17, 2003 - 6:58 am UTC

Tom,
regarding the copy stats query--
I was trying to do exactly the same thing you have said to be useless - trying to export the stats and import to my test server. I thought of getting your opinion about this before start doing it. Hence I searched your site and found your reply ( to my utter shock ). Can you tell me how else can I simulate my production data so that atleast I can be sure of the execution plan ? Also can you tell me why we have the export and import schema stats procedure ?

Tom Kyte
March 17, 2003 - 9:24 am UTC

you can see the execution plan -- but tell me, can you tell me if it is a good one or not??????

if you can look at a plan and say "thats a darn good plan, the best one in fact" with accuracy over say 80% -- I'd like to talk to you about a job, I might have one for you.

Tell me, of the two, which is better?

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=HINT: RULE
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'T'
2 1 NESTED LOOPS
3 2 TABLE ACCESS (FULL) OF 'T'
4 2 INDEX (RANGE SCAN) OF 'T_IDX' (NON-UNIQUE


Row Source Operation
---------------------------------------------------
HASH JOIN
TABLE ACCESS FULL T
TABLE ACCESS FULL T

</code> http://asktom.oracle.com/pls/asktom/f?p=100:11:::::P11_QUESTION_ID:6749454952894 <code>

so, yes, you can see the plans -- my point is -- so what? unless you test the plans against representative data -- so?



Why?

Chuck Jolley, March 17, 2003 - 10:32 am UTC

Why do managers who will strip nekid, and who knows what else, for a doctors physical balk at letting the developers of thier database see the data?
It seems to me that if they don't trust the developers they should get new ones.
On the other hand, we are a tiny shop going against data that is public by law, so maybe I'm missing something.
We frequently fix reports where the problem would never show up at all if we did not have access to production or a VERY recent copy.
What rights SHOULD a development team have to realalisticaly sized and distributed data?
Thanks,
Chuck


stats query

kumar, March 18, 2003 - 11:27 pm UTC

Tom,
Thanks for your response. I can understand when you say that " unless you test the plans against representative data " . I am not going to decide which plan is better for any particular query - I believe the optimizer fully. But Can you tell what are the factors on which the cost calculation is done ? I am asking this because my understanding is that for a particular expected result, the query which gives a plan whose cost is lesser is better than another query whose cost is higher. In that case, i can try tuning by re writing the queries. Please correct me if I am wrong. Also Can you tell what is primary purpose of the dbms_stats - export and import procedures ?

Tom Kyte
March 19, 2003 - 6:36 am UTC

The cost of two queries cannot be compared. In theory - the query with the smallest cost would take less time, however, in practice this is many times not the case.

You might use dbms_stats/export/import to move stats from a staging database to production, instead of analyzing production, after a reload for example.

can you explain please

kumar, March 19, 2003 - 9:33 am UTC

Tom,

"In theory - the query with the smallest cost would take less time, however, in practice this is many times not the case."

Can you explain a bit more on this please ? In that case, the only way to compare queries is to run them against the actual data and test them . Is there no other way to compare the performance of two queries ?

Thanks again for the response

Tom Kyte
March 19, 2003 - 10:58 am UTC

umm, don't know what more to say.

A query with a cost of 1 could take longer then a query with a cost of 100.

The optimizer gets a query.
generates hundreds, maybe thousands of plans. assigned each a cost.
sorts by cost, keeps the least costly.

Tell me:

big_table@ORA920> set echo on
big_table@ORA920> alter session set optimizer_index_cost_adj = 100;

Session altered.

big_table@ORA920> set autotrace traceonly explain
big_table@ORA920> select count(status) from big_table where owner = 'SCOTT';

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=6 Card=1 Bytes=10)
1 0 SORT (AGGREGATE)
2 1 TABLE ACCESS (BY INDEX ROWID) OF 'BIG_TABLE' (Cost=6 Card=204 Bytes=2040)
3 2 INDEX (RANGE SCAN) OF 'BIG_TABLE_OWNER_IDX' (NON-UNIQUE) (Cost=3 Card=204)



big_table@ORA920> alter session set optimizer_index_cost_adj = 1;

Session altered.

big_table@ORA920> select count(status) from big_table where owner = 'SCOTT';

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=2 Card=1 Bytes=10)
1 0 SORT (AGGREGATE)
2 1 TABLE ACCESS (BY INDEX ROWID) OF 'BIG_TABLE' (Cost=2 Card=204 Bytes=2040)
3 2 INDEX (RANGE SCAN) OF 'BIG_TABLE_OWNER_IDX' (NON-UNIQUE) (Cost=3 Card=204)



big_table@ORA920> set autotrace off


So, is the second query "faster"???? No, we just fiddled with stuff to make it have a lower cost.


I would say "you gotta run them"

thanks

kumar, March 20, 2003 - 12:12 am UTC

Tom,

Thanks for the nice illustration. Thats a typical Tom answer, If I can say so.. I strongly feel that the reason behind Oracle as a very succesful product is the knowledge sharing by people like you. Users are never short of knowledge sources..

Keep up the good work.

copying stats

A reader, September 20, 2010 - 9:41 am UTC

Tom,

I have two different PRODUCTION systems running the same thiry party software on their own machines. One update query in system #1 is consuming high CPU so I copied the stats from the good system (#2) to the poorly performing system (#1). there are only two tables involved in the update statement, so I copied stats for those two tables only.

When I run explain plan on the query without running it in system #1, the plan cost and the CPU cost matches to that of the system #2. But when the user processes run that update, OEM grid is still showing the older higer cost / higher CPU plan. CPU consumption has not changed on the DB server too.

Is that an expected behavior?

There are differences between the two production machine, #1 has less CPU and memory than #2. Accordingly SGA is smaller on #1 than #2. Both are at version 10.2.0.4. Load on both are comparable.

The update is being re-written by the vendor to optimize it and the ETA is 3 weeks, and we are in the process of acquiring more CPU and memory for #1. In the interim, I tried the above.

Thanks for your help.
Tom Kyte
September 20, 2010 - 5:20 pm UTC

... Is that an expected behavior? ...

insufficient information. no version information here at all. In 10g, just having new stats does not invalidate a plan immediately. So, the old plan would continue to be used until it got flushed from the shared pool. explain plan ALWAYS does a hard parse so explain plan would show you a plan based on the new stats immediately - even before the server starts using them itself.

look at the load time of the sql, verify that the sql you are comparing are actually the same (if I had a penny for every time someone compared a query with literals to query with binds...)


Copying Stats

A reader, September 21, 2010 - 9:44 am UTC

Sorry I should I have provided database version (10.2.0.4) up front with the question and not towards the end :)

soon after copying the stats, I did flush the shared pool. Even with that I see this. I see only one plan_hash_value for this sql_id in OEM grid, so I know only one plan is active, but it is the old plan.

Way to know if stats was collected or copied.

A reader, December 01, 2010 - 3:29 am UTC

Hi Tom

Is there a way to know if the stats on a partition is a gathered one for being copied from another partition?

Thanks in advance.
---Raj
Tom Kyte
December 07, 2010 - 8:37 am UTC

there is no flag that I'm aware of, however you could take advantage of the fact that things like "last analyzed" and "sample size" will match exactly the partition it was copied from. So, the last analyzed would probably indicate it was a likely copy if it was 'old' and matched another partition (assuming the copied from partition had not had statistics gathered again..)

Dbms_stats Vs Compute Statistics.

Rajeshwaran, Jeyabal, December 07, 2010 - 8:53 am UTC

rajesh@11GR2> select * from v$version;

BANNER
--------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - Production
PL/SQL Release 11.2.0.1.0 - Production
CORE    11.2.0.1.0      Production
TNS for 32-bit Windows: Version 11.2.0.1.0 - Production
NLSRTL Version 11.2.0.1.0 - Production

Elapsed: 00:00:00.00
rajesh@11GR2>
rajesh@11GR2> create table T
  2  nologging
  3  as
  4  select * from all_objects;

Table created.

Elapsed: 00:00:06.07
rajesh@11GR2>
rajesh@11GR2> begin
  2     for i in 1..3
  3     loop
  4             insert /*+ append */ into t
  5             select * from t;
  6             commit;
  7     end loop;
  8  end;
  9  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:32.64
rajesh@11GR2>
rajesh@11GR2> select count(*) from T;

  COUNT(*)
----------
    574352

Elapsed: 00:00:02.21
rajesh@11GR2> exec dbms_stats.gather_table_stats(user,'T');

PL/SQL procedure successfully completed.

Elapsed: 00:00:03.06
rajesh@11GR2>
rajesh@11GR2> select avg_row_len, table_name
  2  from user_tables
  3  where table_name ='T';

AVG_ROW_LEN TABLE_NAME
----------- ------------------------------
         97 T

Elapsed: 00:00:00.01
rajesh@11GR2>
rajesh@11GR2> set autotrace traceonly explain statistics;
rajesh@11GR2> select count(*) from t;

Elapsed: 00:00:02.03

Execution Plan
----------------------------------------------------------
Plan hash value: 2966233522

-------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Cost (%CPU)| Time     |
-------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |  2263   (1)| 00:00:28 |
|   1 |  SORT AGGREGATE    |      |     1 |            |          |
|   2 |   TABLE ACCESS FULL| T    |   574K|  2263   (1)| 00:00:28 |
-------------------------------------------------------------------


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
       8187  consistent gets
       8183  physical reads
          0  redo size
        424  bytes sent via SQL*Net to client
        416  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

rajesh@11GR2> set autotrace off;
rajesh@11GR2>
rajesh@11GR2> analyze table t compute statistics;

Table analyzed.

Elapsed: 00:00:11.11
rajesh@11GR2>
rajesh@11GR2> select avg_row_len, table_name
  2  from user_tables
  3  where table_name ='T';

AVG_ROW_LEN TABLE_NAME
----------- ------------------------------
        100 T

Elapsed: 00:00:00.00
rajesh@11GR2>
rajesh@11GR2>
rajesh@11GR2> set autotrace traceonly explain statistics;
rajesh@11GR2>
rajesh@11GR2> select count(*) from t;

Elapsed: 00:00:02.51

Execution Plan
----------------------------------------------------------
Plan hash value: 2966233522

-------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Cost (%CPU)| Time     |
-------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |  2263   (1)| 00:00:28 |
|   1 |  SORT AGGREGATE    |      |     1 |            |          |
|   2 |   TABLE ACCESS FULL| T    |   574K|  2263   (1)| 00:00:28 |
-------------------------------------------------------------------


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
       8187  consistent gets
       8183  physical reads
          0  redo size
        424  bytes sent via SQL*Net to client
        416  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

rajesh@11GR2>
rajesh@11GR2> set autotrace off;


Tom:

The value for AVG_ROW_LEN is 97 using dbms_stats API but when using Compute statistics its 100. Does this (AVG_ROW_LEN ) will have any impact on the plan's? I dont see any.

http://download.oracle.com/docs/cd/E11882_01/server.112/e16638/stats.htm#i13546
Tom Kyte
December 07, 2010 - 11:50 am UTC

using analyze to gather statistics has been deprecated since Oracle8i - over a decade. The numbers reported by it are not what the optimizer is expecting and the code to compute them is not kept up to date.

DBMS_STATS is the only supportable way to gather statistics.


(p.s. you do not have to put really big examples in front of such simple clear questions as you've been doing. It would have been enough just to say "The value for AVG_ROW_LEN is 97 using dbms_stats API but when using Compute statistics its 100. Does this (AVG_ROW_LEN ) will have any impact on the plan's? I dont see any. " in this case)


Estimate sample

Martin van Donselaar, August 22, 2011 - 7:16 am UTC

Tom,

Do I understand correctly that if you set the sample size to 10% for example, only the first blocks of the table wil be sampled? So, if there are no deletes on the table this will likely be only 'older' records?

I am having troubles with a query that performs well for older months but started performing badly for data created after july 2011. That data is different compared to older data in the tables because some id columns can now contain more than one id, where before this used to be only one.

Could you please advice how to make sure the sample represents 'current' data?
Tom Kyte
August 30, 2011 - 2:13 pm UTC

if you sample 10%, we read 100% of the table but only consider 10% of the data in the estimate.

if you BLOCK sample 10%, we read a random sampling of 10% of the blocks in the table (from all over - not the first 10%)

It probably is not necessarily statistics related in this case (lack of stats on current data that is), but rather that the very nature of the data has changed radically. I don't know what it means for a column to contain "more than one id" - so I cannot really comment further.

More than one ID

Martin, August 31, 2011 - 8:27 am UTC

By more than one ID I mean we used to have only one warehouse where a item could be stored. So, it would always be the same ID. From july we now have 5 warehouses so each record can have a different ID, depending on the warehouse that the item is stored in.

Tom Kyte
August 31, 2011 - 1:57 pm UTC

you'll have to give an example - I know now what it means to have more than one id (your distinct cardinality for that column has gone up, the column is no longer a constant single value - but can vary). But I don't know what your schema/query/plans look like.


It is not due to sampling - we read the entire table and take a 10% random sample - old data and new data both.


Oleksandr Alesinskyy, September 01, 2011 - 7:45 am UTC

if you sample 10%, we read 100% of the table but only consider 10% of the data in the estimate
Which 10%? Randomly selected?
Tom Kyte
September 01, 2011 - 8:23 am UTC

correct, a random 10% sample.


If you gather a 10% sample - review the statistics, gather a 10% sample again - the statistics will likely be different.

ops$tkyte%ORA11GR2> create table t as select * from all_objects;

Table created.

ops$tkyte%ORA11GR2> 
ops$tkyte%ORA11GR2> exec dbms_stats.gather_table_stats( user, 'T', estimate_percent=>10 );

PL/SQL procedure successfully completed.

ops$tkyte%ORA11GR2> select num_rows from user_tables where table_name = 'T';

  NUM_ROWS
----------
     71030

ops$tkyte%ORA11GR2> exec dbms_stats.gather_table_stats( user, 'T', estimate_percent=>10 );

PL/SQL procedure successfully completed.

ops$tkyte%ORA11GR2> select num_rows from user_tables where table_name = 'T';

  NUM_ROWS
----------
     72820