Hi guys.
I have a problem with the estimation of the cardinality of a skewed column
The distribution of the data is as follows:
select m.m_pricelist_id, count(*)
from m_pricelist_version m
group by m.m_pricelist_id 2 3 ;
M_PRICELIST_ID COUNT(*)
-------------- ----------
1000000 1
1000003 2624686
1000001 1
1000002 33375
1000005 1
102 2
1000004 1
101 2
103 1
I have an index on that column, in addition to histograms.
Select *
From USER_TAB_HISTOGRAMS uh
where uh.TABLE_NAME= 'M_PRICELIST_VERSION'
and uh.COLUMN_NAME='M_PRICELIST_ID';
TABLE_NAME COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_ACTUAL_VALUE
------------------------------ ------------------------------ --------------- -------------- ------------------------------
M_PRICELIST_VERSION M_PRICELIST_ID 1 101
M_PRICELIST_VERSION M_PRICELIST_ID 65 1000002
M_PRICELIST_VERSION M_PRICELIST_ID 5560 1000003
M_PRICELIST_VERSION M_PRICELIST_ID 5561 1000005
The cardinality is correctly estimated if the filter is entered with literals (2626K)
SQL> select /*+ gather_plan_statistics */ count(*)
from m_pricelist_version m where m.m_pricelist_id = 1000003 2 ;
COUNT(*)
----------
2624745
Transcurrido: 00:00:00.28
SQL> select * from table(dbms_xplan.display_cursor( format=> 'allstats last'));
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID b8fbc0xn66v9n, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ count(*) from m_pricelist_version
m where m.m_pricelist_id = 1000003
Plan hash value: 3128031140
-----------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
-----------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.28 | 9597 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.28 | 9597 |
|* 2 | INDEX FAST FULL SCAN| M_PRICELIST_VERSION_VALIDFROM | 1 | 2626K| 2624K|00:00:01.30 | 9597 |
-----------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("M"."M_PRICELIST_ID"=1000003)
20 filas seleccionadas.
However, if the filter is pre-calculated from a query:
SQL>select p.m_pricelist_id
from m_pricelist p where p.name='Precios';
M_PRICELIST_ID
--------------
1000003
SQL> select /*+ gather_plan_statistics */ count(*)
from m_pricelist_version m where m.m_pricelist_id =
(select p.m_pricelist_id
from m_pricelist p where p.name='Precios')
2 3 4 5 ;
COUNT(*)
----------
2624745
Transcurrido: 00:00:00.35
SQL> select * from table(dbms_xplan.display_cursor( format=> 'allstats last'));
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID 6z2hy1fb3h0py, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ count(*) from m_pricelist_version
m where m.m_pricelist_id = (select p.m_pricelist_id from m_pricelist p
where p.name='Precios')
Plan hash value: 2632919022
-------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
-------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.35 | 9393 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.35 | 9393 |
|* 2 | INDEX RANGE SCAN | M_PRICELIST_VERSION_VALIDFROM | 1 | 295K| 2624K|00:00:01.47 | 9393 |
| 3 | TABLE ACCESS BY INDEX ROWID| M_PRICELIST | 1 | 1 | 1 |00:00:00.01 | 2 |
|* 4 | INDEX SKIP SCAN | M_PRICELIST_NAME | 1 | 1 | 1 |00:00:00.01 | 1 |
-------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("M"."M_PRICELIST_ID"=)
4 - access("P"."NAME"=U'Precios')
filter("P"."NAME"=U'Precios')
Cardinality differs significantly (295K).
Please any help will be apreciated. Thanks.
The basic issue is that when you join the optimizer doesn't know which value of m_pricelist_id it will be matching on.
For example, I can reproduce findings similar to yours (on 11.2.0.4). Building a table based on dba_objects there are:
- 32,060 objects owned by PUBLIC
- 171 objects owned by CHRIS
If I use these values in the where clause, the estimates are accurate. But if I turn this into a join/subquery then both get 3,029 rows estimated:
create table t1 ( owner not null, object_name not null ) as
select owner, object_name from dba_objects;
create table t2 ( owner not null, io not null )as
select distinct owner, initcap(owner) io from dba_objects;
create index i on t1(owner);
exec dbms_stats.gather_table_stats(user, 't1', method_opt => 'for all columns size skewonly');
exec dbms_stats.gather_table_stats(user, 't2');
select table_name, column_name, histogram from user_tab_col_statistics
where table_name in ('T1', 'T2');
TABLE_NAME COLUMN_NAME HISTOGRAM
T1 OWNER FREQUENCY
T1 OBJECT_NAME HEIGHT BALANCED
T2 OWNER NONE
T2 IO NONE
select /*+ gather_plan_statistics */count(*) from t1
where owner = 'PUBLIC';
COUNT(*)
32,060
select * from table(dbms_xplan.display_cursor(null, null, 'ROWSTATS LAST'));
PLAN_TABLE_OUTPUT
SQL_ID 6kfdkhhvtq4t7, child number 0
-------------------------------------
select /*+ gather_plan_statistics */count(*) from t1 where owner =
'PUBLIC'
Plan hash value: 3309376399
-----------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
-----------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |
|* 2 | INDEX FAST FULL SCAN| I | 1 | 32056 | 32060 |
-----------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("OWNER"='PUBLIC')
select /*+ gather_plan_statistics */count(*) from t1
where owner = 'CHRIS';
COUNT(*)
171
select * from table(dbms_xplan.display_cursor(null, null, 'ROWSTATS LAST'));
PLAN_TABLE_OUTPUT
SQL_ID 901gsm616gw15, child number 0
-------------------------------------
select /*+ gather_plan_statistics */count(*) from t1 where owner =
'CHRIS'
Plan hash value: 163676535
-------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
-------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |
|* 2 | INDEX RANGE SCAN| I | 1 | 163 | 171 |
-------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("OWNER"='CHRIS')
select /*+ gather_plan_statistics */count(*) from t1
where owner in (select t2.owner from t2 where io = 'Public');
COUNT(*)
32,060
select * from table(dbms_xplan.display_cursor(null, null, 'ROWSTATS LAST'));
PLAN_TABLE_OUTPUT
SQL_ID 9f74t08nb6qh8, child number 0
-------------------------------------
select /*+ gather_plan_statistics */count(*) from t1 where owner in
(select t2.owner from t2 where io = 'Public')
Plan hash value: 2896049957
----------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
----------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |
| 2 | NESTED LOOPS | | 1 | 3029 | 32060 |
| 3 | SORT UNIQUE | | 1 | 1 | 1 |
|* 4 | TABLE ACCESS FULL| T2 | 1 | 1 | 1 |
|* 5 | INDEX RANGE SCAN | I | 1 | 3029 | 32060 |
----------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - filter("IO"='Public')
5 - access("OWNER"="T2"."OWNER")
select /*+ gather_plan_statistics */count(*) from t1
where owner in (select t2.owner from t2 where io = 'Chris');
COUNT(*)
171
select * from table(dbms_xplan.display_cursor(null, null, 'ROWSTATS LAST'));
PLAN_TABLE_OUTPUT
SQL_ID 06220jxs1bf4c, child number 0
-------------------------------------
select /*+ gather_plan_statistics */count(*) from t1 where owner in
(select t2.owner from t2 where io = 'Chris')
Plan hash value: 2896049957
----------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
----------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |
| 2 | NESTED LOOPS | | 1 | 3029 | 171 |
| 3 | SORT UNIQUE | | 1 | 1 | 1 |
|* 4 | TABLE ACCESS FULL| T2 | 1 | 1 | 1 |
|* 5 | INDEX RANGE SCAN | I | 1 | 3029 | 171 |
----------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - filter("IO"='Chris')
5 - access("OWNER"="T2"."OWNER")
With the subquery the optimizer doesn't know at parse time whether it's comparing owner to PUBLIC or CHRIS.
But before you obsess too much about making the estimates match the cardinality, it's worth asking:
What problem is this actually causing you?
Having estimated rows very different from actual rows is an indicator that the plan
might be "wrong". But it doesn't mean it is!
For example, using the example above Oracle will choose the same plan whether there's 1 row in t1 or billions:
select /*+ gather_plan_statistics cardinality (t1, 1) */count(*) from t1
where owner in (select t2.owner from t2 where io = 'Chris');
select * from table(dbms_xplan.display_cursor(null, null, 'ROWSTATS LAST'));
PLAN_TABLE_OUTPUT
SQL_ID 9kd044nw9vs49, child number 0
-------------------------------------
select /*+ gather_plan_statistics cardinality (t1, 1) */count(*) from
t1 where owner in (select t2.owner from t2 where io = 'Chris')
Plan hash value: 2896049957
----------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
----------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |
| 2 | NESTED LOOPS | | 1 | 1 | 171 |
| 3 | SORT UNIQUE | | 1 | 1 | 1 |
|* 4 | TABLE ACCESS FULL| T2 | 1 | 1 | 1 |
|* 5 | INDEX RANGE SCAN | I | 1 | 1 | 171 |
----------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - filter("IO"='Chris')
5 - access("OWNER"="T2"."OWNER")
select /*+ gather_plan_statistics cardinality (t1, 1000000000) */count(*) from t1
where owner in (select t2.owner from t2 where io = 'Chris');
select * from table(dbms_xplan.display_cursor(null, null, 'ROWSTATS LAST'));
PLAN_TABLE_OUTPUT
SQL_ID 8kghbrj5gtsm3, child number 0
-------------------------------------
select /*+ gather_plan_statistics cardinality (t1, 1000000000)
*/count(*) from t1 where owner in (select t2.owner from t2 where io =
'Chris')
Plan hash value: 2896049957
----------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
----------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |
| 2 | NESTED LOOPS | | 1 | 40M| 171 |
| 3 | SORT UNIQUE | | 1 | 1 | 1 |
|* 4 | TABLE ACCESS FULL| T2 | 1 | 1 | 1 |
|* 5 | INDEX RANGE SCAN | I | 1 | 40M| 171 |
----------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - filter("IO"='Chris')
5 - access("OWNER"="T2"."OWNER")
Note: the cardinality hint is undocumented, use with caution!
So the real question you should be asking is:
Is the optimizer choosing a full tablescan when it should use an index (or vice versa)?
If it is causing you performance problems, you could look into using SQL profiles and/or baselines to tweak the plan:
http://www.oracle.com/technetwork/database/bi-datawarehousing/twp-sql-plan-management-11gr2-133099.pdf