Skip to Main Content
  • Questions
  • How many buckets to specify for histograms?

Breadcrumb

Question and Answer

Tom Kyte

Thanks for the question, Bruce.

Asked: July 04, 2002 - 2:57 pm UTC

Last updated: February 04, 2005 - 11:45 am UTC

Version: 8.1.6

Viewed 1000+ times

You Asked

When analyzing a table to setup histograms, how does Oracle determine the number of buckets? I notice different numbers in user_histograms for a column than the number of buckets I specify in the command. For example, I do some analysis and determine that 68 buckets are needed for a reference table with about 23,000 rows. Then I execute this command:

analyze table geo_xref_tb
compute statistics
for columns map_ref size 68

and then looking at the data dictionary, I see this

select table_name,column_name,count(*)
from user_histograms
where table_name = 'GEO_XREF_TB'
group by table_name,column_name
having count(*) > 2

TABLE_NAME COLUMN_NAME COUNT(*)
------------------------------ --------------- ----------
GEO_XREF_TB MAP_REF 64

It seems Oracle has decided that it doesn't need all the 68 that I specified. I have not come across the situation where Oracle has increased the number of buckets beyond what I specify. Most often Oracle and I either have the same number or are +/- 1. Am I counting the number of buckets wrong or chalk it up to "the database knows better than I do"?

Since it knows what its doing and its numbers are all that count, is it better to set the size to the maximum 254 all the time and let Oracle figure out what it needs?

Thank-you for your help,
Bruce


and Tom said...

It is a normal, expected outcome -- it did what you asked. What you are seeing is the different outcome you get when you specify a number of buckets that is LESS then the number of distinct values.

Consider:

ops$tkyte@ORA817DEV.US.ORACLE.COM> create table t ( x int );

Table created.

ops$tkyte@ORA817DEV.US.ORACLE.COM>
ops$tkyte@ORA817DEV.US.ORACLE.COM> insert into t select rownum from all_users where rownum <= 10;

10 rows created.

ops$tkyte@ORA817DEV.US.ORACLE.COM> insert into t select 5 from all_objects;

22819 rows created.

ops$tkyte@ORA817DEV.US.ORACLE.COM>
ops$tkyte@ORA817DEV.US.ORACLE.COM>
ops$tkyte@ORA817DEV.US.ORACLE.COM> analyze table t compute statistics for columns x size 10;

Table analyzed.

ops$tkyte@ORA817DEV.US.ORACLE.COM>
ops$tkyte@ORA817DEV.US.ORACLE.COM> select column_name, endpoint_number, endpoint_value
2 from user_tab_histograms
3 where table_name = 'T';

COL ENDPOINT_NUMBER ENDPOINT_VALUE
--- --------------- --------------
X 1 1
X 2 2
X 3 3
X 4 4
X 22824 5
X 22825 6
X 22826 7
X 22827 8
X 22828 9
X 22829 10

10 rows selected.

ops$tkyte@ORA817DEV.US.ORACLE.COM>
ops$tkyte@ORA817DEV.US.ORACLE.COM> analyze table t delete statistics;

Table analyzed.

ops$tkyte@ORA817DEV.US.ORACLE.COM> analyze table t compute statistics for columns x size 8;

Table analyzed.

ops$tkyte@ORA817DEV.US.ORACLE.COM>
ops$tkyte@ORA817DEV.US.ORACLE.COM>
ops$tkyte@ORA817DEV.US.ORACLE.COM> select column_name, endpoint_number, endpoint_value
2 from user_tab_histograms
3 where table_name = 'T';

COL ENDPOINT_NUMBER ENDPOINT_VALUE
--- --------------- --------------
X 7 5
X 8 10
X 0 1

ops$tkyte@ORA817DEV.US.ORACLE.COM>



So, did it do the right thing? Yes, it did. I asked for 8 and appear to have gotten 3 but I really got 8 (like I asked for). It is just that redundant ones (that could be inferred) were left out.

There is a support note, available on metalink as well, explains it in detail:

</code> http://asktom.oracle.com/pls/asktom/f?p=100:11:::::P11_QUESTION_ID:3126073805757 <code>

read through that...





Rating

  (10 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Good discussion

Bruce, July 05, 2002 - 10:18 am UTC


number of default buckets

A reader, October 28, 2002 - 6:45 am UTC

Hi

in 8.1.7 doc it says that the default number of buckets when we generate histograms is 75

cut & pasted from doc:
Specify the maximum number of buckets in the histogram. The default value is 75, minimum value is 1, and maximum value is 254

But this is wrong right? I think the number of buckets is 1.

By default when we compute statistics what does it do?

for table or for all columns?
for indexes? for all indexed columns?

what are the default values for the for clause? Cant find it anywhere in the document

cheers


Tom Kyte
October 28, 2002 - 7:33 am UTC

sigh, try it and see?

ops$tkyte@ORA920.US.ORACLE.COM> drop table t;

Table dropped.

ops$tkyte@ORA920.US.ORACLE.COM> create table t as select owner from all_objects;

Table created.

ops$tkyte@ORA920.US.ORACLE.COM>
ops$tkyte@ORA920.US.ORACLE.COM> analyze table t compute statistics
  2  for table
  3  for all columns;

Table analyzed.

ops$tkyte@ORA920.US.ORACLE.COM>
ops$tkyte@ORA920.US.ORACLE.COM> select count(*)
  2    from user_tab_histograms
  3   where table_name = 'T';

  COUNT(*)
----------
        33

ops$tkyte@ORA920.US.ORACLE.COM>
ops$tkyte@ORA920.US.ORACLE.COM>

ops$tkyte@ORA920.US.ORACLE.COM> analyze table t delete statistics;

Table analyzed.

ops$tkyte@ORA920.US.ORACLE.COM>
ops$tkyte@ORA920.US.ORACLE.COM> analyze table t compute statistics
  2  for table
  3  for all columns size 1;

Table analyzed.

ops$tkyte@ORA920.US.ORACLE.COM>
ops$tkyte@ORA920.US.ORACLE.COM> select count(*)
  2    from user_tab_histograms
  3   where table_name = 'T';

  COUNT(*)
----------
         2

ops$tkyte@ORA920.US.ORACLE.COM>
ops$tkyte@ORA920.US.ORACLE.COM>

Apparently, it is more then 1.....  documentation is accurate on this point. 

SIZE clause in ANALYZE

A reader, December 03, 2003 - 3:37 pm UTC

Tom,

1.Can you pls. clarify what SIZE clause in ANALYZE means. I read the Oracle documentation and is not clear. Also, in your first example, you mentioned that it shows 8.
2. If I analyze any non-integer column, how do I get the range of values in each bucket. ie. If I analyze the integer column, the ENDPOINT_VALUE column shows the column values and the difference between ENDPOINT_NUMBER column between the previous and current row will give the number of values in that range. However, when I analyzed and created a histogram on VARCHAR column, the ENDPOINT_ACTUAL_VALUE was null.
3. Can you eloborate on this as to how to interpret the output from _TAB_HISTOGRAM?.


Inappropriate Index Selection

Sami, August 10, 2004 - 8:45 am UTC

Dear Tom,

Here pptmizer is not choosing the correct index.
All statistics are upto date.
Looks like we missed histograms or number of bukets are not good enough.

BUSUK(# of records=244558) is more 50% records in the table but still 
it is going for index scan instead of FTS.

Many of the queries use this column in their predicate and we see very poor performance.
Could you please advise us, how to indentify the number of buckets for the given scenario?
Your valuable advise is appreciated as always.


Table Name: PROFILEDUSER

INDEX_NAME        COLUMN_NAME        COLUMN_POSITION
=======================    =====================   =====
PROFILEDUSER_I01    USER_CATEGORY        1
PROFILEDUSER_I02    BUSINESS_COUNTRY_ID    1
PROFILEDUSER_IX01    SEARCH_FIRST_NAME    1
PROFILEDUSER_IX02    SEARCH_LAST_NAME    1
PROFILEDUSER_IX03    SEARCH_COMPANY_NAME    1

PROFILEDUSER_IX04    USERID            1
PROFILEDUSER_IX04    PROFILEDUSERID        2

PROFILEDUSER_FK10    CAMLEVELID        1
PROFILEDUSER_PK        PROFILEDUSERID        1
PROFILEDUSER_AK1    USERID            1
PROFILEDUSER_IX05    REGION            1
PROFILEDUSER_IX06    PROCESSED_BY        1


SQL> select USER_CATEGORY,count(*) from profileduser group by USER_CATEGORY;
 
USER_CATEGORY                         COUNT(*)
---------------------------------------- ----------
BUSCA                                          8420
BUSHK                                         39688
BUSHKHANGSENG                                 12526
BUSSG                                           863
BUSUK                                        244558
BUSUS                                         18657
DOTCOM                                       135216
GIB                                            5188
HIBM                                          12752
                                               4738
                                         ===========
                                         482639
                                         ===========
10 rows selected.

SQL> l
  1  SELECT endpoint_number,endpoint_value,endpoint_actual_value
  2  FROM dba_histograms
  3  WHERE owner='CRMF'
  4  AND          table_name='PROFILEDUSER'
  5* AND column_name='USER_CATEGORY'
SQL> /
 
ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_A
--------------- -------------- ----------
              0     3.4418E+35
              1     3.7533E+35


SQL> select count(*) from profileduser where USER_CATEGORY='BUSSG';
 
Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE (Cost=682 Card=1 Bytes=6)
   1    0   SORT (AGGREGATE)
   2    1     INDEX (RANGE SCAN) OF 'PROFILEDUSER_I01' (NON-UNIQUE) (Cost=682 Card=52745 Bytes=316470)
 
 
 
 
SQL> select count(*) from profileduser where USER_CATEGORY='BUSUK';
 
Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE (Cost=682 Card=1 Bytes=6)
   1    0   SORT (AGGREGATE)
   2    1     INDEX (RANGE SCAN) OF 'PROFILEDUSER_I01' (NON-UNIQUE) (Cost=682 Card=52745 Bytes=316470) 

Tom Kyte
August 10, 2004 - 8:58 am UTC

everything but "how we compute statistics"

first thought would be size 10 or more. more buckets than values.

Using dbms_stats.gather_table_stats/gather_index_stats

A reader, August 10, 2004 - 3:22 pm UTC

Dear Tom,

We compute the statistics using the follwoing procedure (every 14 days).

Is it okay to use gather_table_stats instead of gather_schema_stats?

PROCEDURE ANALYZE_SCHEMA
(Schemaname in Varchar2)
as

Begin
for t in (
select owner, table_name from dba_tables
where owner = Schemaname
and table_name not like 'DR%'
and temporary <> 'Y')
Loop

dbms_stats.gather_table_stats(ownname=>t.owner, tabname=>t.table_name, degree=>4);

for i in (
select owner, index_name from dba_indexes
where owner = Schemaname
and index_type <> 'DOMAIN'
and table_name = t.table_name)

Loop

dbms_stats.gather_index_stats(ownname=>i.owner, indname=>i.index_name);

End loop;

End loop;
End;


Tom Kyte
August 10, 2004 - 3:57 pm UTC

sure.

trying to understand ENDPOINT_NUMBER & ENDPOINT_VALUE

A reader, August 10, 2004 - 3:59 pm UTC

Dear Tom,

Thanks for your continuous help and advise.

Could you please explain how to interpret ENDPOINT_NUMBER and ENDPOINT_VALUE column?

SQL> l
  1  select ENDPOINT_NUMBER,ENDPOINT_VALUE,ENDPOINT_ACTUAL_VALUE
  2  from dba_histograms
  3  where table_name='MY_TAB'
  4* and column_name='COL_NAME'
/

ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_ACTUAL_VALUE
--------------- -------------- -----------------------------
          89443     3.4418E+35 BUSHK 
         109247     3.4418E+35 BUSHKHANGSENG 
         109428     3.4418E+35 BUSSG 
         369305     3.4418E+35 BUSUK
         388535     3.4418E+35 BUSUS 
         617934     3.5469E+35 DOTCOM 
         620589     3.7014E+35 GIB 
         651972     3.7533E+35 HIBM 
            471     3.4418E+35 BUSCA 
 
9 rows selected.

Thanks in advance. 

Index with NULL values

Sami, August 11, 2004 - 9:19 am UTC

Dear Tom,

Could you please give some advice on this issue? Thanks for all your help in the past.

Table has upto date statistics. All the predicate columns are indexed.


SQL>select region,count(*) from profileduser group by region;

    REGION   COUNT(*)
---------- ----------
      4047         87
      4048        523
      4049      50627
      4050      29420
      4051       1991
               574073 <== NULL values count
=====================
           656721
=====================

We have non-unique index on region column. The below query works fine and it is doing only 396 LIO.
But if I add REGION predicate to the same query, then it doing  29054 LIO and 19681 PIO.
I understand that the execution plan got changed after adding REGION predicate.

I know somewhere you mentioned that the COST doesn't matter. 

Without REGION predicate, COST=19681 but response time 18secs.
With REGION predicate, COST=2966 but response time 1 Min 25 secs.

My questions are 
1. since Region column has 575K records with NULL values, should I drop the index on that column?
(optimizer is using region index and coming up with the worse plan)

2.The region column has HISTOGRAM too.

3. Is there anything I should do so that optimizer will chose the best plan (less LIO and PIO)


With NO  REGION predicate
==========================

SQL>l
  1  select count(first_name) from profileduser p,extendedattributes e
  2  where p.profileduserid=e.profileduserid
  3  and p.hsbc_user_category in ('VAL1','VAL2')
  4* and e.lastupdatedate >= sysdate -30
SQL> /
 
              102
 
Elapsed: 00:00:18.41
 
Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE (Cost=3287 Card=1 Bytes=90) 
   1    0   SORT (AGGREGATE)
   2    1     NESTED LOOPS (Cost=3287 Card=1533 Bytes=137970)
   3    2       TABLE ACCESS (BY INDEX ROWID) OF 'EXTENDEDATTRIBUTES' (Cost=221 Card=1533 Bytes=65919) 
   4    3         INDEX (RANGE SCAN) OF 'EXTENDEDATTRIBUTES_IX03' (NON-UNIQUE) (Cost=15 Card=1533) 
   5    2       TABLE ACCESS (BY INDEX ROWID) OF 'PROFILEDUSER' (Cost=2 Card=34158 Bytes=1605426) 
   6    5         INDEX (UNIQUE SCAN) OF 'PROFILEDUSER_PK' (UNIQUE) (Cost=1 Card=34158)
 
 
Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
        396  consistent gets
          0  physical reads
          0  redo size
        377  bytes sent via SQL*Net to client
        431  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

With REGION predicate
======================

SQL>l
  1  select count(first_name) from profileduser p,extendedattributes e
  2  where p.profileduserid=e.profileduserid
  3  and p.hsbc_user_category in ('VAL1','VAL2')
  4  and e.lastupdatedate >= sysdate -30
  5* and p.region =4049
SQL>  
 
               19
 
Elapsed: 00:01:25.64
 
Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE (Cost=2966 Card=1 Bytes=92) 
   1    0   SORT (AGGREGATE)
   2    1     HASH JOIN (Cost=2966 Card=1533 Bytes=141036)
   3    2       TABLE ACCESS (BY INDEX ROWID) OF 'EXTENDEDATTRIBUTES' (Cost=221 Card=1533 Bytes=65919) 
   4    3         INDEX (RANGE SCAN) OF 'EXTENDEDATTRIBUTES_IX03' (NON-UNIQUE) (Cost=15 Card=1533) 
   5    2       TABLE ACCESS (BY INDEX ROWID) OF 'PROFILEDUSER' (Cost=2741 Card=2634 Bytes=129066)
   6    5         INDEX (RANGE SCAN) OF 'PROFILEDUSER_IX05' (NON-UNIQUE) (Cost=378 Card=2634) 
 
Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
      29054  consistent gets
      19681  physical reads
          0  redo size
        376  bytes sent via SQL*Net to client
        431  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          6  sorts (memory)
          0  sorts (disk)
          1  rows processed
 
SQL> 
 

Tom Kyte
August 11, 2004 - 11:39 am UTC

well, since in reality you'll be using bind variables, what happens when you do it with binds?


and how do you gather stats? what is the *exact* command.

poor explain plan because of wrong index

A reader, August 11, 2004 - 2:36 pm UTC

Dear Tom,

In reality we use ref cursor which doesn't use bind variables. (We tried using bind variable using sys_context and got some negative performance. we still use 8i).

We use dbms_stats.gather_table_stats and gather_index_stats to collect statistics.




Tom Kyte
August 12, 2004 - 7:43 am UTC

use bind variables -- ugh. you want negative performance -- don't use them.


that is not what I call "the exact command". that is like saying "we used analyze" - instead of "we used analyze table t compute statistics for table for all indexes for all indexed columms size 200"

One gives no information,
the other gives the exact command.

Histograms and Cardinality

Dave T, February 03, 2005 - 5:22 am UTC

Hi Tom, 

Great stuff.

How could you overcome an incorrect cardinality estimate forcing the optimiser down an index?

Example:

I have the following table that is used to control the batches into a datawarehouse:

SELECT *
FROM   META_ETL_BATCHES_TO_LOAD;

  BATCH_ID ETL_BATCH_NO B DATE_LOAD
---------- ------------ - ---------
         1           87 N

I am loading from a table called DS_LEGAL_ENTITY:

SELECT COUNT(*) 
FROM   DS_LEGAL_ENTITY
WHERE  ETL_BATCH_NO = 87;

  COUNT(*)
----------
  14797299

The majority of the loads us an IN query to find the batches to load:

SELECT * 
FROM   DS_LEGAL_ENTITY
WHERE  etl_batch_no IN  ( SELECT ETL_BATCH_NO
                FROM   META_ETL_BATCHES_TO_LOAD
              WHERE  BATCH_LOADED = 'N' );

The plan for this however shows:

Execution PLAN
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE (COST=34034 Card=583615 BYtes=81122485)

   1    0   TABLE ACCESS (BY GLOBAL INDEX ROWID) OF 'DS_LEGAL_ENTITY' (COST=34023 Card=583615 Bytes=77037180)

   2    1     NESTED LOOPS (COST=34034 Card=583615 Bytes=81122485)
   3    2       SORT (UNIQUE)
   4    3         TABLE ACCESS (FULL) OF 'META_ETL_BATCHES_TO_LOAD' (COST=5 Card=1 Bytes=7)

   5    2       INDEX (RANGE SCAN) OF 'DS_LEGAL_ENTITY_IDX_4' (NON-UNIQUE) (COST=1112 Card=583615)

This suggests that only 583615 rows will come from the DS_legal_entity table, hence the nested loops and index range scan.

In reality of course there are 15 million.  

How can I get the plan to reflect the reality.  I have created a histogram on the etl_batch_no in the ds_legal_entity table with 254 buckets.  When you query the data like this:

SQL> SELECT *
  2  FROM   DS_LEGAL_ENTITY
  3  WHERE  ETL_BATCH_NO = 87;

Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE (Cost=43352 Card=15174001
          Bytes=2002968132)

   1    0   PARTITION RANGE (ALL)
   2    1     TABLE ACCESS (FULL) OF 'DS_LEGAL_ENTITY' (Cost=43352 Car
          d=15174001 Bytes=2002968132)

The optimiser has a better estimate of the number of rows to return.

How can I get the orignal query with an IN closer to this estimate?

 

Tom Kyte
February 03, 2005 - 1:39 pm UTC

you would need to two step this.

for you see:

SELECT *
FROM DS_LEGAL_ENTITY
WHERE etl_batch_no IN ( SELECT ETL_BATCH_NO
FROM META_ETL_BATCHES_TO_LOAD
WHERE BATCH_LOADED = 'N' );

is optimized - when it is optimized, it does not actually run any part of the query. so it does not know that the etl_batch_no returns 15million or 5 records -- it is as if this were a "bind variable" -- could be any value (really -- it could be)

does that IN return alwaya a single value?

can you two step it? get the value, use etl_batch = VALUE?

even if not a single value, can we two step it? get the values, use the literals


(not binding is OK for big statements like this...)

Dave T, February 04, 2005 - 6:50 am UTC

Tom,

Thanks for the reply.

The ETL_BATCH_NO will be one or more values and this will change from week to week as new batches are loaded into an operational data store.

The IN query is used in views to automatically pick up the batches to load. If there was only one batch then the following query procduces a better guess of the cardinality:

SELECT *
FROM DS_LEGAL_ENTITY
WHERE etl_batch_no = ( SELECT ETL_BATCH_NO
FROM META_ETL_BATCHES_TO_LOAD
WHERE BATCH_LOADED = 'N' );

In terms of your idea of doing it two stage, would this mean generating a view dynamically?

Thanks.


Tom Kyte
February 04, 2005 - 11:45 am UTC

no, it means using dynamic sql with literals.

so you would:

select etl_batch_no from META_ETL_BATCHES_TO_LOAD
WHERE BATCH_LOADED = 'N'

fetch things, create a string like 1,2,3,4,5

and then

select * from ds_legal_entity where etl_batch_no in ( || that string || )

for etl processing, no binds is OK and sometimes preferred. the output of that subquery could be almost "anything", so the optimizer doesn't know whether to index range or full scan.