Helena Markova, March 30, 2001 - 12:33 am UTC
sample size in analyze
Ramesh, November 30, 2001 - 10:43 am UTC
From ADministrative guide chapter 20
General Management of schema Objects
" In either case, if you specify a percentage greater than 50, or a number of rows or index values that is greater than 50% of those in the object, Oracle computes the exact statistics, rather than estimating."
If I specify
Analyze table table_name estimate statistics sample 50 percent;
It is analyzing like compute statistics.( I checked sample_size with both compute and with sample 50 percent)
If I analyze table with 49 percent the sample size from user_Tables is in no of rows which is not exactly as 49 percent,
09:59:07 kumarr@IDRDEV SQL>analyze table test estimate statistics sample 49 percent;
Table analyzed.
Elapsed: 00:00:00.79
10:04:54 kumarr@IDRDEV SQL> select table_name, sample_size,num_rows from user_tables
10:05:10 2 where table_name='TEST';
TABLE_NAME SAMPLE_SIZE NUM_ROWS
------------------------------ ----------- ----------
TEST 10644 29986
Elapsed: 00:00:00.20
10:05:13 kumarr@IDRDEV SQL> select 10644/29986*100 from dual;
10644/29986*100
---------------
35.4965651
Please clarify
Thanks in advance
November 30, 2001 - 12:58 pm UTC
It is an estimate after all, it is never exact. We scan the first couple of blocks of the table to get an estimate of the number of rows in the table all together (so that can skew this very much). We then generate random rowids based on the blocks in the table (from the extents) and remove duplicate rowids from this set (further decreasing the number of rows).
If you want to analyze exactly 49%, you'll need to count the rows and analyze based on rows -- not a sample pct.
anaylze with sample size
Ramesh, November 30, 2001 - 3:03 pm UTC
Even though I calculate 49 % for sample rows and used the figure for estimating , the sample size in user_tables is differnt . Why ?
14:59:45 kumarr@IDRDEV SQL> select 29986/100*49 from dual;
29986/100*49
------------
14693.14
Elapsed: 00:00:00.00
15:00:09 kumarr@IDRDEV SQL> analyze table test estimate statistics
15:00:31 2 sample 14694 rows;
Table analyzed.
Elapsed: 00:00:00.77
15:01:20 kumarr@IDRDEV SQL> select sample_size,num_rows from user_tables
15:01:34 2 where table_name='TEST';
SAMPLE_SIZE NUM_ROWS
----------- ----------
10694 29986
November 30, 2001 - 5:34 pm UTC
Ok, i mispoke on the previous response.
Using the sample N rows fixes issues with regards to the sample scan of the first couple of blocks to get the est. number of rows -- you still have the issue that 14,694 rowids were randomly generated - only 10,694 of which were unique.
I doubt you would ever get exactly 49% analyzed for this reason, you would have to over-estimate the number of rows you wanted. its not exact (its an estimate)
dbms_stats.gather_schema_stats: estimate_percent
jan van mourik, November 30, 2001 - 10:19 pm UTC
Tom,
In your answer here:
</code>
http://asktom.oracle.com/pls/asktom/f?p=100:11:::::P11_QUESTION_ID:1411208576534 <code>
you use this example:
BEGIN dbms_stats.gather_schema_stats
(user, 99, method_opt =>null, cascade=> true );
end;
/
You set estimate_percent to 99.
Documentation:
Percentage of rows to estimate (NULL means compute): The valid range is [0.000001,100).
Do you have any particular reason to set it to 99 for gather_schema_stats, and not 100 or NULL? Or does a percentage > 49 mean COMPUTE for this proc too?
Thanks, jan
December 01, 2001 - 5:30 am UTC
pct > 49 is the same as 100%.
I used 99 because the value range is not inclusive of 100:
1 BEGIN dbms_stats.gather_schema_stats
2 (user, 100, method_opt =>null, cascade=> true );
3* end;
scott@ORA817DEV.US.ORACLE.COM> /
BEGIN dbms_stats.gather_schema_stats
*
ERROR at line 1:
ORA-20001: Illegal sample percent 100: must be in the range [0.000001,100)
ORA-06512: at "SYS.DBMS_STATS", line 4479
ORA-06512: at "SYS.DBMS_STATS", line 4610
ORA-06512: at "SYS.DBMS_STATS", line 4720
ORA-06512: at "SYS.DBMS_STATS", line 4702
ORA-06512: at line 1
Copy Stats
A reader, October 26, 2002 - 5:54 pm UTC
Tom,
Don't know if this sounds stupid enough, but this doubt is really bothering me...
At the development environment we have only a fraction of the actual onsite data. Its very hard to simulate performance problems due to lack of data. My question is that can we simulate the onsite data by loading all the data dictionary tables at the development environment with the same statistics as in Onsite database?
Is this feasible? I do believe that data dictionary tables are non updateable...but can we use dms_stats or someother package for maintaining the tables with the details we provide?
If this is feasible, would this be a good idea.
Thanks.
October 27, 2002 - 11:13 am UTC
You know -- this week (Nov 1st) I'm giving a presentation on some peformance stuff and one of the slides reads exactly:
Don't Develop...
- On an empty database
- Importing stats isnt good enough
- Can you really look at a plan and say thats a good one
- With a single user (quote from asktom below)
Before the stress test, and when we ran application by itself, it worked
fine. That means SQL statements are committed properly. But somehow
when muti-process access the database resource, it broke.
- In a dust free lab
.
and expect it to work in production!
So, yes, you can import the stats from prod using dbms_stats but I question what this will buy you in the end.
Tell me, can you look at a page of explain plan output and say "yes, that is exactly what we want. When this runs against 1,000,000 rows it will be as good as when it ran against 1,000"
I strongly feel the developers need to develop against live sized data if they want to have any chance of detecting and correcting issues BEFORE rolling out the code.
Can you guide me through this please
A reader, October 27, 2002 - 11:28 am UTC
Tom,
Thanks for the response. Copying stats will never be a 100% substitute for live data. But, in our case getting data is not going to happen and we would have to develop against a fraction of the actual data.
We are plagued by performance problems for big queries on reports. Now this is the last option we have on hand.
1. Can you please let me know to what percent of accuracy would i be getting by copying stats.
2. Also, if i have 5 tables (say), should i copy column by column? Iam not familiar with dbms_stats and got confused over the creation on a table which holds the stats. Which functions should i be using and in what order.
Thanks
October 27, 2002 - 11:45 am UTC
But I ask you to ask yourself -- of what use would you find getting the query plans?
Suppose
o you have the stats
o you run a query
o it runs good against "test"
o you look at the plan
now what? will it run good in production? can you tell? how?
You don't have to "get real data", you just have to "generate real enough data".
You will have your plague forever otherwise. Tuning a query against a fraction of the data is virtually worthless. Knowing the query plan generated with stats that are big but data that is small -- not very useful (in my opinion)
You would be getting 100% accuracy in the query plans if you:
o set the init.ora parameters the same on both databases (sort areas, hash areas, sga size, multi-block read count, etc etc etc)
o had the same stats
But I ask you -- so what?
You'll need to read about dbms_stats (fortunately we document these packages!)
</code>
http://docs.oracle.com/docs/cd/A87860_01/doc/appdev.817/a76936/dbms_sta.htm#999107
and in particular:
http://docs.oracle.com/docs/cd/A87860_01/doc/appdev.817/a76936/dbms_st2.htm#1010202 <code>
you can probably use export schema stats to get the stats for your entire application schema.
Copy Stats
kumar, March 17, 2003 - 6:58 am UTC
Tom,
regarding the copy stats query--
I was trying to do exactly the same thing you have said to be useless - trying to export the stats and import to my test server. I thought of getting your opinion about this before start doing it. Hence I searched your site and found your reply ( to my utter shock ). Can you tell me how else can I simulate my production data so that atleast I can be sure of the execution plan ? Also can you tell me why we have the export and import schema stats procedure ?
March 17, 2003 - 9:24 am UTC
you can see the execution plan -- but tell me, can you tell me if it is a good one or not??????
if you can look at a plan and say "thats a darn good plan, the best one in fact" with accuracy over say 80% -- I'd like to talk to you about a job, I might have one for you.
Tell me, of the two, which is better?
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=HINT: RULE
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'T'
2 1 NESTED LOOPS
3 2 TABLE ACCESS (FULL) OF 'T'
4 2 INDEX (RANGE SCAN) OF 'T_IDX' (NON-UNIQUE
Row Source Operation
---------------------------------------------------
HASH JOIN
TABLE ACCESS FULL T
TABLE ACCESS FULL T
</code>
http://asktom.oracle.com/pls/asktom/f?p=100:11:::::P11_QUESTION_ID:6749454952894 <code>
so, yes, you can see the plans -- my point is --
so what? unless you test the plans against representative data --
so?
Why?
Chuck Jolley, March 17, 2003 - 10:32 am UTC
Why do managers who will strip nekid, and who knows what else, for a doctors physical balk at letting the developers of thier database see the data?
It seems to me that if they don't trust the developers they should get new ones.
On the other hand, we are a tiny shop going against data that is public by law, so maybe I'm missing something.
We frequently fix reports where the problem would never show up at all if we did not have access to production or a VERY recent copy.
What rights SHOULD a development team have to realalisticaly sized and distributed data?
Thanks,
Chuck
stats query
kumar, March 18, 2003 - 11:27 pm UTC
Tom,
Thanks for your response. I can understand when you say that " unless you test the plans against representative data " . I am not going to decide which plan is better for any particular query - I believe the optimizer fully. But Can you tell what are the factors on which the cost calculation is done ? I am asking this because my understanding is that for a particular expected result, the query which gives a plan whose cost is lesser is better than another query whose cost is higher. In that case, i can try tuning by re writing the queries. Please correct me if I am wrong. Also Can you tell what is primary purpose of the dbms_stats - export and import procedures ?
March 19, 2003 - 6:36 am UTC
The cost of two queries cannot be compared. In theory - the query with the smallest cost would take less time, however, in practice this is many times not the case.
You might use dbms_stats/export/import to move stats from a staging database to production, instead of analyzing production, after a reload for example.
can you explain please
kumar, March 19, 2003 - 9:33 am UTC
Tom,
"In theory - the query with the smallest cost would take less time, however, in practice this is many times not the case."
Can you explain a bit more on this please ? In that case, the only way to compare queries is to run them against the actual data and test them . Is there no other way to compare the performance of two queries ?
Thanks again for the response
March 19, 2003 - 10:58 am UTC
umm, don't know what more to say.
A query with a cost of 1 could take longer then a query with a cost of 100.
The optimizer gets a query.
generates hundreds, maybe thousands of plans. assigned each a cost.
sorts by cost, keeps the least costly.
Tell me:
big_table@ORA920> set echo on
big_table@ORA920> alter session set optimizer_index_cost_adj = 100;
Session altered.
big_table@ORA920> set autotrace traceonly explain
big_table@ORA920> select count(status) from big_table where owner = 'SCOTT';
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=6 Card=1 Bytes=10)
1 0 SORT (AGGREGATE)
2 1 TABLE ACCESS (BY INDEX ROWID) OF 'BIG_TABLE' (Cost=6 Card=204 Bytes=2040)
3 2 INDEX (RANGE SCAN) OF 'BIG_TABLE_OWNER_IDX' (NON-UNIQUE) (Cost=3 Card=204)
big_table@ORA920> alter session set optimizer_index_cost_adj = 1;
Session altered.
big_table@ORA920> select count(status) from big_table where owner = 'SCOTT';
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=2 Card=1 Bytes=10)
1 0 SORT (AGGREGATE)
2 1 TABLE ACCESS (BY INDEX ROWID) OF 'BIG_TABLE' (Cost=2 Card=204 Bytes=2040)
3 2 INDEX (RANGE SCAN) OF 'BIG_TABLE_OWNER_IDX' (NON-UNIQUE) (Cost=3 Card=204)
big_table@ORA920> set autotrace off
So, is the second query "faster"???? No, we just fiddled with stuff to make it have a lower cost.
I would say "you gotta run them"
thanks
kumar, March 20, 2003 - 12:12 am UTC
Tom,
Thanks for the nice illustration. Thats a typical Tom answer, If I can say so.. I strongly feel that the reason behind Oracle as a very succesful product is the knowledge sharing by people like you. Users are never short of knowledge sources..
Keep up the good work.
copying stats
A reader, September 20, 2010 - 9:41 am UTC
Tom,
I have two different PRODUCTION systems running the same thiry party software on their own machines. One update query in system #1 is consuming high CPU so I copied the stats from the good system (#2) to the poorly performing system (#1). there are only two tables involved in the update statement, so I copied stats for those two tables only.
When I run explain plan on the query without running it in system #1, the plan cost and the CPU cost matches to that of the system #2. But when the user processes run that update, OEM grid is still showing the older higer cost / higher CPU plan. CPU consumption has not changed on the DB server too.
Is that an expected behavior?
There are differences between the two production machine, #1 has less CPU and memory than #2. Accordingly SGA is smaller on #1 than #2. Both are at version 10.2.0.4. Load on both are comparable.
The update is being re-written by the vendor to optimize it and the ETA is 3 weeks, and we are in the process of acquiring more CPU and memory for #1. In the interim, I tried the above.
Thanks for your help.
September 20, 2010 - 5:20 pm UTC
... Is that an expected behavior? ...
insufficient information. no version information here at all. In 10g, just having new stats does not invalidate a plan immediately. So, the old plan would continue to be used until it got flushed from the shared pool. explain plan ALWAYS does a hard parse so explain plan would show you a plan based on the new stats immediately - even before the server starts using them itself.
look at the load time of the sql, verify that the sql you are comparing are actually the same (if I had a penny for every time someone compared a query with literals to query with binds...)
Copying Stats
A reader, September 21, 2010 - 9:44 am UTC
Sorry I should I have provided database version (10.2.0.4) up front with the question and not towards the end :)
soon after copying the stats, I did flush the shared pool. Even with that I see this. I see only one plan_hash_value for this sql_id in OEM grid, so I know only one plan is active, but it is the old plan.
Way to know if stats was collected or copied.
A reader, December 01, 2010 - 3:29 am UTC
Hi Tom
Is there a way to know if the stats on a partition is a gathered one for being copied from another partition?
Thanks in advance.
---Raj
December 07, 2010 - 8:37 am UTC
there is no flag that I'm aware of, however you could take advantage of the fact that things like "last analyzed" and "sample size" will match exactly the partition it was copied from. So, the last analyzed would probably indicate it was a likely copy if it was 'old' and matched another partition (assuming the copied from partition had not had statistics gathered again..)
Dbms_stats Vs Compute Statistics.
Rajeshwaran, Jeyabal, December 07, 2010 - 8:53 am UTC
rajesh@11GR2> select * from v$version;
BANNER
--------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - Production
PL/SQL Release 11.2.0.1.0 - Production
CORE 11.2.0.1.0 Production
TNS for 32-bit Windows: Version 11.2.0.1.0 - Production
NLSRTL Version 11.2.0.1.0 - Production
Elapsed: 00:00:00.00
rajesh@11GR2>
rajesh@11GR2> create table T
2 nologging
3 as
4 select * from all_objects;
Table created.
Elapsed: 00:00:06.07
rajesh@11GR2>
rajesh@11GR2> begin
2 for i in 1..3
3 loop
4 insert /*+ append */ into t
5 select * from t;
6 commit;
7 end loop;
8 end;
9 /
PL/SQL procedure successfully completed.
Elapsed: 00:00:32.64
rajesh@11GR2>
rajesh@11GR2> select count(*) from T;
COUNT(*)
----------
574352
Elapsed: 00:00:02.21
rajesh@11GR2> exec dbms_stats.gather_table_stats(user,'T');
PL/SQL procedure successfully completed.
Elapsed: 00:00:03.06
rajesh@11GR2>
rajesh@11GR2> select avg_row_len, table_name
2 from user_tables
3 where table_name ='T';
AVG_ROW_LEN TABLE_NAME
----------- ------------------------------
97 T
Elapsed: 00:00:00.01
rajesh@11GR2>
rajesh@11GR2> set autotrace traceonly explain statistics;
rajesh@11GR2> select count(*) from t;
Elapsed: 00:00:02.03
Execution Plan
----------------------------------------------------------
Plan hash value: 2966233522
-------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time |
-------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 2263 (1)| 00:00:28 |
| 1 | SORT AGGREGATE | | 1 | | |
| 2 | TABLE ACCESS FULL| T | 574K| 2263 (1)| 00:00:28 |
-------------------------------------------------------------------
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
8187 consistent gets
8183 physical reads
0 redo size
424 bytes sent via SQL*Net to client
416 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
rajesh@11GR2> set autotrace off;
rajesh@11GR2>
rajesh@11GR2> analyze table t compute statistics;
Table analyzed.
Elapsed: 00:00:11.11
rajesh@11GR2>
rajesh@11GR2> select avg_row_len, table_name
2 from user_tables
3 where table_name ='T';
AVG_ROW_LEN TABLE_NAME
----------- ------------------------------
100 T
Elapsed: 00:00:00.00
rajesh@11GR2>
rajesh@11GR2>
rajesh@11GR2> set autotrace traceonly explain statistics;
rajesh@11GR2>
rajesh@11GR2> select count(*) from t;
Elapsed: 00:00:02.51
Execution Plan
----------------------------------------------------------
Plan hash value: 2966233522
-------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time |
-------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 2263 (1)| 00:00:28 |
| 1 | SORT AGGREGATE | | 1 | | |
| 2 | TABLE ACCESS FULL| T | 574K| 2263 (1)| 00:00:28 |
-------------------------------------------------------------------
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
8187 consistent gets
8183 physical reads
0 redo size
424 bytes sent via SQL*Net to client
416 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
rajesh@11GR2>
rajesh@11GR2> set autotrace off;
Tom:
The value for AVG_ROW_LEN is 97 using dbms_stats API but when using Compute statistics its 100. Does this (AVG_ROW_LEN ) will have any impact on the plan's? I dont see any.
http://download.oracle.com/docs/cd/E11882_01/server.112/e16638/stats.htm#i13546
December 07, 2010 - 11:50 am UTC
using analyze to gather statistics has been deprecated since Oracle8i - over a decade. The numbers reported by it are not what the optimizer is expecting and the code to compute them is not kept up to date.
DBMS_STATS is the only supportable way to gather statistics.
(p.s. you do not have to put really big examples in front of such simple clear questions as you've been doing. It would have been enough just to say "The value for AVG_ROW_LEN is 97 using dbms_stats API but when using Compute statistics its 100. Does this (AVG_ROW_LEN ) will have any impact on the plan's? I dont see any. " in this case)
Estimate sample
Martin van Donselaar, August 22, 2011 - 7:16 am UTC
Tom,
Do I understand correctly that if you set the sample size to 10% for example, only the first blocks of the table wil be sampled? So, if there are no deletes on the table this will likely be only 'older' records?
I am having troubles with a query that performs well for older months but started performing badly for data created after july 2011. That data is different compared to older data in the tables because some id columns can now contain more than one id, where before this used to be only one.
Could you please advice how to make sure the sample represents 'current' data?
August 30, 2011 - 2:13 pm UTC
if you sample 10%, we read 100% of the table but only consider 10% of the data in the estimate.
if you BLOCK sample 10%, we read a random sampling of 10% of the blocks in the table (from all over - not the first 10%)
It probably is not necessarily statistics related in this case (lack of stats on current data that is), but rather that the very nature of the data has changed radically. I don't know what it means for a column to contain "more than one id" - so I cannot really comment further.
More than one ID
Martin, August 31, 2011 - 8:27 am UTC
By more than one ID I mean we used to have only one warehouse where a item could be stored. So, it would always be the same ID. From july we now have 5 warehouses so each record can have a different ID, depending on the warehouse that the item is stored in.
August 31, 2011 - 1:57 pm UTC
you'll have to give an example - I know now what it means to have more than one id (your distinct cardinality for that column has gone up, the column is no longer a constant single value - but can vary). But I don't know what your schema/query/plans look like.
It is not due to sampling - we read the entire table and take a 10% random sample - old data and new data both.
Oleksandr Alesinskyy, September 01, 2011 - 7:45 am UTC
if you sample 10%, we read 100% of the table but only consider 10% of the data in the estimate
Which 10%? Randomly selected?
September 01, 2011 - 8:23 am UTC
correct, a random 10% sample.
If you gather a 10% sample - review the statistics, gather a 10% sample again - the statistics will likely be different.
ops$tkyte%ORA11GR2> create table t as select * from all_objects;
Table created.
ops$tkyte%ORA11GR2>
ops$tkyte%ORA11GR2> exec dbms_stats.gather_table_stats( user, 'T', estimate_percent=>10 );
PL/SQL procedure successfully completed.
ops$tkyte%ORA11GR2> select num_rows from user_tables where table_name = 'T';
NUM_ROWS
----------
71030
ops$tkyte%ORA11GR2> exec dbms_stats.gather_table_stats( user, 'T', estimate_percent=>10 );
PL/SQL procedure successfully completed.
ops$tkyte%ORA11GR2> select num_rows from user_tables where table_name = 'T';
NUM_ROWS
----------
72820