Skip to Main Content
  • Questions
  • why would statistics drive optimizer to make a poor index choice

Breadcrumb

Question and Answer

Tom Kyte

Thanks for the question.

Asked: October 24, 2011 - 10:53 am UTC

Last updated: November 15, 2011 - 7:37 am UTC

Version: 9208

Viewed 1000+ times

You Asked

Unfortunately.. I cannot provide a working example because this invovles a table of 120 million rows in 10 partitions. When specifing an 11 month date range over this data, a partitioned index is used which causes the query to time out. A hint for a full table scan causes it to finish in 10 minutes. Lower date ranges go with the full table scan. All the forums I've posted on indicates this is usually a sign that the optimizer is making a bad decision on the statistics. The statitics are fresh after the last data load. Do you have any kind of methodology to track down why the optimizer would be making a bad choice? We rebuilt the index, recomputed stats, and the optimizer used the index scan (the bad plan) even MORE often under more circumstances.

and Tom said...

The optimizer makes the wrong choice not because statistics are "bad", but because estimating cardinalities can be *very* difficult.

Consider a situation like this - you have a table that has a persons month of birth. This table also has their zodiac sign in it as well.

If you gather statistics - the optimizer will know very well that 1/12th of the people in that table were born in December (assuming a uniform distribution of course, which we will).

If you gather statistics - the optimizer will know very well that 1/12th of the people in that table are Pisces (assuming that same uniform distribution).

Now, assume there are 10,000,000 rows in this table.

where month_of_birth = 'dec' will be estimated to return about 833,333 records.
where zodiac_sign = 'Pisces' will be estimated to return about 833,333 records.

What will "where month_of_birth = 'dec' and zodiac_sign = 'Pisces'" be estimated at?

Well, the optimizer knows A LOT about month_of_birth, it knows A LOT about zodiac_size. It knows (in 10g and before, and in 11g by default) NOTHING about the two together.

So, it will use basic statistics, assume that the two predicates are independent of each other and estimate that 1/12th * 1/12th of the data or about 70,000 records will be returned.


Now, if the optimizer guesses 70,000 records and you've partitioned the data by month of birth (for example), that is a lot of the records in that partition. The optimizer is going to full scan probably - not use an index.

But you know and I know that there are really 0 records because all of us Pisces were born in Feb or Mar - not Dec. So, we get the wrong plan.



At a high level - that is what is going on here. There is an incorrect cardinality estimate happening.

In current releases, we have tons of ways to diagnose this and correct for it. The SQL Monitor in 11g displays the estimated versus actual cardinalities for a query - allowing you to quickly see "where did it go wrong". And then you have many tools to attempt to correct it - from SQL Profiles to Extended statistics (in 11g - you could gather stats on month_of_birth AND zodiac_sign *together*, we'd know that there are no records and estimate the cardinality to be 1 - no 70k).


In 9i, you have to do a lot of the work yourself and your toolset for fixing it is much smaller.


What you'd need to do is get the plan with the estimated cardinalities (get it from v$sql, not from explain plan - dbms_xplan can be used to extract it from v$sql - if you have access to my book Effective Oracle by Design - I show exactly how to do this, it is pretty easy). Then, you would either need to

a) use your knowledge of what the REAL cardinalities are
b) get a tkprof that shows you the real cardinalities from executing the query

to figure out where the estimate is going bad. And then - apply some corrective action to 'fix' it.

Perhaps that means getting histograms on that column.

Perhaps that means getting local partition statistics instead of global.

Perhaps that means collecting global statistics instead of letting us roll them up (prior to 11g, the number of distinct values in the default global statistics could be way off - which would throw off cardinality estimates in a big way)

Perhaps that means using dynamic sampling
http://www.oracle.com/technetwork/issue-archive/2009/09-jan/o19asktom-086775.html
(works best if you don't use binds - which if this is a long running, as it is, it would be OK to not bind in this case)


But it will always go back to "estimated versus actual" cardinalities when tracking these things down.

Rating

  (6 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Is this what is meant by local partition statistics?

A reader, October 25, 2011 - 11:36 am UTC

exec dbms_stats.gather_table_stats( user, 'T', 'PART1', granularity=>'PARTITION');

Tom Kyte
October 25, 2011 - 12:14 pm UTC

that is what is meant by "to gather local partition statistics", yes.


Explain Plan and TKPROF diverge unbelievably

A reader, October 28, 2011 - 10:42 am UTC

Even in my good plan and fast running query, as well as my slower one, I'm seeing a huge divergence between the explain plan "rows" column and the tkprof "rows" column.
Here's a small snippet from the "bad" plan.

Explain plan -
PLAN_TABLE_OUTPUT

-------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost | Pstart| Pstop |
-------------------------------------------------------------------------------------------------------------------------------
| 21 | PARTITION RANGE ALL | | | | | | 1 | 14 |
|* 22 | INDEX RANGE SCAN | WAR_TIME_RAD_SUM_I1 | 78 | | | 28 | 1 | 14 |
|* 23 | SORT JOIN | | 58 | 522 | | 7 | | |
|* 24 | TABLE ACCESS FULL | ENTITY_BU | 58 | 522 | | 4 | | |

And here is from the tkprof -
16126695 PARTITION RANGE ALL PARTITION: 1 14
16126695 INDEX RANGE SCAN WAR_TIME_RAD_SUM_I1 PARTITION: 1 14 (object id 146378)
1858311 SORT JOIN
86 TABLE ACCESS FULL ENTITY_BU


Stats are collected with -
EXECUTE DBMS_STATS.GATHER_SCHEMA_STATS(OWNNAME => '${OWNER}', degree => ${DEGREE}, -
ESTIMATE_PERCENT => 100, -
CASCADE => TRUE);

Degree is always 8 -

The stats are current. Any idea where to sniff around for this one?
Tom Kyte
October 28, 2011 - 11:32 am UTC

not without a lot more information - like the predicate that is being evaluated at that point in the plan.

Please try to make things readable as well - yes, I have a fixed width rule here - things will wrap, try to present the information in a format we can read... 80 columns wide - just like a punch card :)

how to get the predicate

A reader, October 29, 2011 - 1:34 pm UTC

This statement is a select involving a view which involves the "large" partitioned table. Is there any way to get the statement the optimizer converted that into? I believe the view was merged because there is no "VIEW" in the plan. Or can I make a best guess as to what part of the query translates to that part of the plan.?
Tom Kyte
October 31, 2011 - 10:56 am UTC

just use dbms_xplan, it shows you the predicate applied at each and every step of the plan.


ops$tkyte%ORA11GR2> select * from dual where dummy = 'Z';

no rows selected

ops$tkyte%ORA11GR2> select * from table(dbms_xplan.display_cursor);

PLAN_TABLE_OUTPUT
-------------------------------------------------------------------------------
SQL_ID  ghusfvfh0ny5c, child number 0
-------------------------------------
select * from dual where dummy = 'Z'

Plan hash value: 272002086

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |       |       |     2 (100)|          |
|*  1 |  TABLE ACCESS FULL| DUAL |     1 |     2 |     2   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("DUMMY"='Z')


18 rows selected.


there are a LOT of ways to use dbms_xplan, you can use it to read the plan right out of v$sql by a sql_id, read about it in the supplied plsql packages guide if you are not familiar with it.

v$sql, dbms_xplan, & tkprof

A reader, November 10, 2011 - 11:50 am UTC

Tom - up until now I've gone on the assumption that the explain plan is what the optimzer thinks is going to happen, and the tkprof is what actually did happen. True or false? Are you suggesting a better method of comparison is the normal explain plan vs the v$sql method using eg., SELECT * FROM table(DBMS_XPLAN.DISPLAY_CURSOR(('gwp663cqh5qbf',0)); ?

Tom Kyte
November 10, 2011 - 1:46 pm UTC

that the explain plan is what
the optimzer thinks is going to happen, and the tkprof is what actually did
happen. True or false?


I would add one word in there "thinks MIGHT will happen"

explain plan doesn't bind peek - the real optimizer does

explain plan hard parses right now, in this environment, as things look at this instant, reality (the query executing an hour ago) might have been different.

explain plan assumes all binds are VARCHAR2, so if you do something like "where indexed_column_that_is_a_string = :bind_variable_that_is_a_number", explain plan will see "string = string", when it fact, the optimizer will see "to_number(string) = number" - we'll have applied a function to the database column, obviating the index access path in real life.

and so on.


Always use the v$ tables if you want "what really happened", or "what will really happen"

what about tkprof

A reader, November 11, 2011 - 10:27 am UTC

You mentioned how explain plan has some limitations but you mentioned nothing of tkprof. Am I better off looking at v$sql_plan than tkprof?
Tom Kyte
November 11, 2011 - 11:00 am UTC

tkprof has two possible sections

explain plan - generated at the time the report is run using explain plan, it is unreliable.

row source operation - generated at run time, 100% true, it is what was actually used.

Here is an example:

ops$tkyte%ORA11GR2> create table t ( x varchar2(20) primary key, data varchar2(2000) );

Table created.

ops$tkyte%ORA11GR2> 
ops$tkyte%ORA11GR2> variable n number
ops$tkyte%ORA11GR2> exec dbms_monitor.session_trace_enable;

PL/SQL procedure successfully completed.

ops$tkyte%ORA11GR2> select * from t where x = :n;

no rows selected






select *
from
 t where x = :n


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        1      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        3      0.00       0.00          0          0          0           0

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 374  (OPS$TKYTE)
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         0          0          0  TABLE ACCESS FULL T (cr=0 pr=0 pw=0 time=6 us cost=2 size=1014 card=1)


Rows     Execution Plan
-------  ---------------------------------------------------
      0  SELECT STATEMENT   MODE: ALL_ROWS
      0   TABLE ACCESS (BY INDEX ROWID) OF 'T' (TABLE)
      0    INDEX (UNIQUE SCAN) OF 'SYS_C0025570' (INDEX (UNIQUE))






See the row source operation - that is what ACTUALLY happened, see the execution plan section (generated because I tkprofed with explain=u/p), that is NOT what happened, that is what explain plan thought would happen, because explain plan thought :n was a varchar2, not a number...


never use explain=u/p with tkprof and you'll be OK - when you see row source operation in there, you know that is the plan that was used.

Philippe, November 13, 2011 - 7:22 am UTC

Why Oracle doesn't change the "explain plan" to accept real variables ?
These are technical limitations ?
With these limitations, it may be best to discourage it's use ?
Tom Kyte
November 15, 2011 - 7:37 am UTC

because you simply do not bind to explain plan.

ops$tkyte%ORA11GR2> select * from dual where :n = 1;
SP2-0552: Bind variable "N" not declared.
ops$tkyte%ORA11GR2> explain plan for select * from dual where :n = 1;

Explained.

ops$tkyte%ORA11GR2> 



there is no facility for "binding" to an explain plan - it just takes a query and hard parses it right then, right there - at that point in time.

There would have to be a new facility - but even then with things like bind peeking - it would be "unreliable" for telling you what plan was used for a given query (that after all is what we are MOST often looking for - why did this query run slow - let's look at the plan...)


I would recommend dbms_xplan against the v$ tables so you can see what really happened - not what would happen if the query just happened to be parsed with value 'x'.

If you want to see what would happen if the query was parsed with 'x', explain it with literals (using to_date/to_number/to_timestamp to coerce things to the right type when necessary)

More to Explore

Performance

Get all the information about database performance in the Database Performance guide.