Skip to Main Content

Breadcrumb

Question and Answer

Tom Kyte

Thanks for the question, Bernard.

Asked: December 30, 2004 - 11:04 am UTC

Last updated: June 09, 2010 - 8:45 am UTC

Version: 9.2.0.5

Viewed 1000+ times

You Asked

Hello,

I am currently investigating histograms. As such, I use the function called WIDTH_BUCKET for dealing with skewed data.

However, It is only for numeric data.

How to deal with non-muneric data, such as DATE or CHAR/VARCHAR datatypes ?


Thx

and Tom said...

Say you have a script like this:


-------------------------- histo.sql -----------------------------------
column hist format a40

define TNAME=&1
define CNAME=&2
define BUCKETS=&3

select wb,
cnt,
to_char(round( 100*cnt/(max(cnt) over ()),2),'999.00') rat,
rpad( '*', 40*cnt/(max(cnt) over ()), '*' ) hist
from (
select wb,
count(*) cnt
from (
select width_bucket( r, 0, (select count(distinct &cname) from &tname)+1, &buckets) wb
from (
select dense_rank() over (order by &cname) r
from &tname
)
)
group by wb
)
order by wb
/
-------------------------------------------------------------------------


You can use it to visualize the relative spread of data.

for example:

ops$tkyte@ORA9IR2> @histo all_objects owner 10
ops$tkyte@ORA9IR2> set verify off
ops$tkyte@ORA9IR2> set echo off

WB CNT RAT HIST
---------- ---------- ------- ----------------------------------------
1 1 .01
2 265 1.89
3 929 6.62 **
4 238 1.70
5 998 7.11 **
6 93 .66
7 11753 83.75 *********************************
8 14033 100.00 ****************************************
9 131 .93
10 270 1.92

10 rows selected.

shows there is clumping...

ops$tkyte@ORA9IR2> @histo all_objects owner 20
ops$tkyte@ORA9IR2> set verify off
ops$tkyte@ORA9IR2> set echo off

WB CNT RAT HIST
---------- ---------- ------- ----------------------------------------
2 1 .01
3 4 .03
4 261 1.91
5 27 .20
6 902 6.61 **
7 234 1.71
8 4 .03
9 29 .21
10 969 7.10 **
11 7 .05
12 86 .63
13 11738 85.98 **********************************
14 15 .11
15 13652 100.00 ****************************************
16 381 2.79 *
17 2 .01
18 129 .94
19 270 1.98

18 rows selected.

since 18 < 20, we'd have a value per bucket and see it might be worth it -- since most of the data is 2 points

ops$tkyte@ORA9IR2> @histo all_objects created 10
ops$tkyte@ORA9IR2> set verify off
ops$tkyte@ORA9IR2> set echo off

WB CNT RAT HIST
---------- ---------- ------- ----------------------------------------
1 2804 25.49 **********
2 1347 12.25 ****
3 952 8.65 ***
4 957 8.70 ***
5 11000 100.00 ****************************************
6 6473 58.85 ***********************
7 1343 12.21 ****
8 2074 18.85 *******
9 649 5.90 **
10 1112 10.11 ****

10 rows selected.

created has some clumping as well... but:


ops$tkyte@ORA9IR2> @histo all_objects object_id 15
ops$tkyte@ORA9IR2> set verify off
ops$tkyte@ORA9IR2> set echo off

WB CNT RAT HIST
---------- ---------- ------- ----------------------------------------
1 1914 99.95 ***************************************
2 1914 99.95 ***************************************
3 1914 99.95 ***************************************
4 1914 99.95 ***************************************
5 1914 99.95 ***************************************
6 1914 99.95 ***************************************
7 1914 99.95 ***************************************
8 1915 100.00 ****************************************
9 1914 99.95 ***************************************
10 1914 99.95 ***************************************
11 1914 99.95 ***************************************
12 1914 99.95 ***************************************
13 1914 99.95 ***************************************
14 1914 99.95 ***************************************
15 1914 99.95 ***************************************

15 rows selected.

ops$tkyte@ORA9IR2>


object id is nicely distributed (should be -- primary key!)

Rating

  (10 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Bernard H, December 30, 2004 - 2:45 pm UTC

Thx for this great insight.

Congratulations for this tremendous Web Site.

Moreover, it seems to be easy to reach experts (very available). I am really impressed as I was while reading your book "Effective Oracle by Design".

Keep up with this excellent work and Happy New Year 2005.

Anto, February 11, 2005 - 10:05 am UTC

Hi Tom,

Thanks for such a useful script

ops$tkyte@ORA9IR2> @histo all_objects owner 20
ops$tkyte@ORA9IR2> set verify off
ops$tkyte@ORA9IR2> set echo off

WB CNT RAT HIST
---------- ---------- ------- ----------------------------------------
2 1 .01
3 4 .03
4 261 1.91
5 27 .20
6 902 6.61 **
7 234 1.71
8 4 .03
9 29 .21
10 969 7.10 **
11 7 .05
12 86 .63
13 11738 85.98 **********************************
14 15 .11
15 13652 100.00 ****************************************
16 381 2.79 *
17 2 .01
18 129 .94
19 270 1.98

18 rows selected.

since 18 < 20, we'd have a value per bucket and see it might be worth it --
since most of the data is 2 points

a) Can you please explain the above again please ?

b) Can we estimate the optimum bucket size for histograms using this script ?

c) Can we modify this script to get the column value as well ?


thanks
Anto



Tom Kyte
February 12, 2005 - 7:44 am UTC

Ok, starting with a)  Run the queries from the inside out:



ops$tkyte@ORA9IR2> create table t as select * from all_objects;
Table created.
 

ops$tkyte@ORA9IR2> define TNAME=&1
ops$tkyte@ORA9IR2> define CNAME=&2
ops$tkyte@ORA9IR2> define BUCKETS=&3
ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2> select * from (
  2  select &cname, dense_rank() over (order by &cname) r
  3    from &tname
  4  ) where rownum <= 10
  5  /
 
OWNER                                   R
------------------------------ ----------
B                                       1
BIG_TABLE                               2
BIG_TABLE                               2
BIG_TABLE                               2
CTXSYS                                  3
CTXSYS                                  3
CTXSYS                                  3
CTXSYS                                  3
CTXSYS                                  3
CTXSYS                                  3
 
10 rows selected.

<b>assign a ranking (dense, 1, 2, 3... ) to the strings.  width_buckets for creating histograms -- they need a number.  mapping of strings to numbers in sorted order</b>

 
ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2> select * from (
  2  select &cname,
  3        width_bucket( r, 0, (select count(distinct &cname) from &tname)+1, &buckets) wb
  4   from (
  5  select &cname, dense_rank() over (order by &cname) r
  6    from &tname
  7        )
  8  ) where rownum <= 10
  9  /
 
OWNER                           WB
------------------------------ ---
B                                1
BIG_TABLE                        1
BIG_TABLE                        1
BIG_TABLE                        1
CTXSYS                           2
CTXSYS                           2
CTXSYS                           2
CTXSYS                           2
CTXSYS                           2
CTXSYS                           2
 
10 rows selected.

<b>Now, put them into "buckets".  we've told Oracle "here is a range of numbers - r, now, please put r into a histogram of &BUCKETS buckets, low range of the histogram is 0 and the high non-inclusive bound is count(distinct &cname)+1 -- since the upper bound is non-inclusive, we added 1"</b>


ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2> select wb,
  2         count(*) cnt,
  3             min(&cname) lo,
  4             max(&cname) hi
  5    from (
  6  select &cname, width_bucket( r, 0, (select count(distinct &cname) from &tname)+1, &buckets) wb
  7   from (
  8  select &cname, dense_rank() over (order by &cname) r
  9    from &tname
 10        )
 11        )
 12  group by wb
 13  /
 
 WB    CNT LO                             HI
--- ------ ------------------------------ ------------------------------
  1      4 B                              BIG_TABLE
  2    263 CTXSYS                         DW_XPS
  3    333 MDSYS                          OPS$TKYTE
  4    998 ORDPLUGINS                     ORDSYS
  5     93 OUTLN                          PERFSTAT
  6  11743 PUBLIC                         SCOTT
  7  14037 SYS                            SYSTEM
  8      2 U1                             U2
  9    131 USER1                          WMSYS
 10    274 WSMGMT                         XDB
 
10 rows selected.

<b>once we have them in buckets -- grouping them and counting how many fall into the bucket it easy.  we can even get the HI and LO values in the buck as demostrated, the remaing level in the query is just for formatting</b>

<b>so, onto number 2 here:</b>


ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2> set echo off
@wb t object_id 10
old   9:           min(&cname) lo,
new   9:           min(object_id) lo,
old  10:           max(&cname) hi
new  10:           max(object_id) hi
old  12: select &cname, width_bucket( r, 0, (select count(distinct &cname) from &tname)+1, &buckets) wb
new  12: select object_id, width_bucket( r, 0, (select count(distinct object_id) from t)+1, 10) wb
old  14: select &cname, dense_rank() over (order by &cname) r
new  14: select object_id, dense_rank() over (order by object_id) r
old  15:   from &tname
new  15:   from t
 
 WB DATA                  CNT RAT     HIST
--- ------------------ ------ ------- ----------------------------------------
  1 3/2849               2787   99.96 ***************************************
  2 2850/5722            2788  100.00 ****************************************
  3 5723/8540            2788  100.00 ****************************************
  4 8541/11328           2788  100.00 ****************************************
  5 11329/14116          2788  100.00 ****************************************
  6 14117/16904          2788  100.00 ****************************************
  7 16905/19692          2788  100.00 ****************************************
  8 19693/22480          2788  100.00 ****************************************
  9 22481/25285          2788  100.00 ****************************************
 10 25286/38285          2787   99.96 ***************************************
 
10 rows selected.
 
@wb t owner 10
old   9:           min(&cname) lo,
new   9:           min(owner) lo,
old  10:           max(&cname) hi
new  10:           max(owner) hi
old  12: select &cname, width_bucket( r, 0, (select count(distinct &cname) from &tname)+1, &buckets) wb
new  12: select owner, width_bucket( r, 0, (select count(distinct owner) from t)+1, 10) wb
old  14: select &cname, dense_rank() over (order by &cname) r
new  14: select owner, dense_rank() over (order by owner) r
old  15:   from &tname
new  15:   from t
 
 WB DATA                  CNT RAT     HIST
--- ------------------ ------ ------- ----------------------------------------
  1 B/BIG_TABLE             4     .03
  2 CTXSYS/DW_XPS         263    1.87
  3 MDSYS/OPS$TKYTE       333    2.37
  4 ORDPLUGINS/ORDSYS     998    7.11 **
  5 OUTLN/PERFSTAT         93     .66
  6 PUBLIC/SCOTT        11743   83.66 *********************************
  7 SYS/SYSTEM          14037  100.00 ****************************************
  8 U1/U2                   2     .01
  9 USER1/WMSYS           131     .93
 10 WSMGMT/XDB            274    1.95
 
10 rows selected.

<b>you are not using this to determine the optimal nubmer, but rather to visualize of the data is skewed...

object id, appears to not be (use a &3 that fills your screen, lets you see any patterns).


owner appears to definitely be. and in fact, if you used 254 for the buckets in my case, it would completely fill out since I only have 20 distinct schemas with objects.  You would see that owner is a candidate for histograms by the pattern.


The number of buckets -- 254 seems good.  If there are less distinct values in the table, you get less.  If there are more, you get as many as you can.  It would be rare to have to play with that number</b>


c) done.


 

Anto, February 14, 2005 - 9:59 am UTC

Hi Tom,

Thanks for the detailed explanation.
I understand the scripts are useful to visualize the skewness of data

I still am not clear how to arrive at the bucket size. Is there any method for that or we have to experiment to get the optimal value?

Here is what I found from Oracle metalink

"The default number of buckets is 75. This value provides an appropriate level of detail for most data distributions. However, since the number of buckets in the histogram, the sampling rate, and the data distribution all affect the usefulness of a histogram, you may need to experiment with different numbers of buckets to obtain the best results.

If the number of frequently occurring distinct values in a column is relatively small, then it is useful to set the number of buckets to be greater than the number of frequently occurring distinct values. "

As per the above statement, the bucket size in this case for Owner column should be around 5 or 10.(based on number of frequently occurring distinct values for the column), how come you say " the bucket size of 254 seems to be good" ?

thanks
Anto

Tom Kyte
February 14, 2005 - 1:26 pm UTC

254, use the biggest one there -- if you have less than 254 discrete values, it'll scale down to a bucket/value. If you have more, you'll get the most you can.

Difference between NTILE and WIDTH_BUCKET

Zahir M, January 22, 2006 - 9:19 pm UTC

Tom ,
Could you please elaborate the difference between NTILE and WIDTH_BUCKET?

Thanks


Tom Kyte
January 23, 2006 - 9:39 am UTC

http://docs.oracle.com/docs/cd/B19306_01/server.102/b14200/functions214.htm#i1001630

<quote src=that url>
Purpose

WIDTH_BUCKET lets you construct equiwidth histograms, in which the histogram range is divided into intervals that have identical size. (Compare this function with NTILE, which creates equiheight histograms.) Ideally each bucket is a closed-open interval of the real number line. For example, a bucket can be assigned to scores between 10.00 and 19.999... to indicate that 10 is included in the interval and 20 is excluded. This is sometimes denoted [10, 20).
</quote>


width bucket -> user supplied range is used...
ntile -> number of observations is the same in each group, regardless of range


ops$tkyte@ORA9IR2> select r,
  2         width_bucket( r, 1, 11, 3 ) wb,
  3         width_bucket( r, 1, 100, 3 ) wb,
  4         ntile(3) over (order by r) nb
  5    from t;

         R         WB         WB         NB
---------- ---------- ---------- ----------
         1          1          1          1
         2          1          1          1
         3          1          1          1
         4          1          1          1
         5          2          1          2
         6          2          1          2
         7          2          1          2
         8          3          1          3
         9          3          1          3
        10          3          1          3

10 rows selected.


See how width bucket very much depends on the range - it takes the user supplied range - splits it into N groups (when range is 1 <= x <100, the ranges would be 0..34, 35..67 and 68..99 - with 100 being an outlier in group 4 for example) 

Ntile

Zahir M, February 03, 2006 - 9:15 pm UTC

Tom , 

Few questions on NTILE 

1) Should NTILE returning 4 since the sal is the same ( 99 ) across all the employees in dept # 30 .

  1   Select deptno , ename , sal , hiredate ,
  2  ntile(4) over( order by sal ) nt ,
  3  width_bucket( sal , 1 , 100 , 4) wb
  4  from emptmp
  5* where deptno = 30
SQL> /

    DEPTNO ENAME             SAL HIREDATE          NT         WB
---------- ---------- ---------- --------- ---------- ----------
        30 ALLEN              99 20-FEB-81          1          4
           WARD               99 22-FEB-81          1          4
           JAMES              99 03-DEC-81          2          4
           BLAKE              99 01-MAY-81          2          4
           TURNER             99 08-SEP-81          3          4
           MARTIN             99 28-SEP-81          4          4


6 rows selected.

SQL> select version from v$instance;

VERSION
-----------------
10.2.0.1.0


2) What is the meaning of equi-width and equi-height histogram ( as mentioned in the manual ) ?

 

Tom Kyte
February 06, 2006 - 12:08 am UTC

no, you've missed the intention of ntile. Ntile is a lot like (row_number/count/N)

its goal is to take a set of COUNT records and break it into N groups - each of about the same size.

You have COUNT=6
You set N = 4
with 6 things, there will be about 1 or 2 things per group.

It broke it into 4 groups for you.



comparing ntile and WIDTH_BUCKET

Matteo, September 10, 2007 - 12:08 pm UTC

Hi Tom,
I'm trying to better understand width_bucket comparing it with ntile, but I found a strange(for me) behaviour:

If I create 4 buckets from 0 to 1000 I would expect
bucket01 0..249,
bucket02 250..499,
bucket03 500..749,
bucket04 750..999

and values:
100 to bucket01
340 to bucket02
550 to bucket03
and so on.

Simulating with random numbers:

select r,
ntile(4) over(order by r) nt,
WIDTH_BUCKET(r, 0, 1000, 4) wb

from (
select trunc(dbms_random.value(0, 1000)) r
from (select level le from dual connect by level <100)
)
order by r

I get:
R NT WB
0 3 2
9 1 3
9 2 3
9 2 4
11 4 3
30 4 4
31 3 1
40 1 4
41 3 3
60 2 1
81 4 1
98 3 2
......

While I expect all these values to be on bucket01.

Can you help me understanding why?

Thanks
Matteo

Tom Kyte
September 15, 2007 - 2:17 pm UTC

side effect of calling random, in the above query, R is just dbms_random.value - and is getting re-evaluated.

do this instead:

ops$tkyte%ORA10GR2> create table t
  2  as
  3      select trunc(dbms_random.value(0, 1000)) r
  4      from (select level le from dual connect by level <100)
  5  /

Table created.

ops$tkyte%ORA10GR2>
ops$tkyte%ORA10GR2> select r,
  2         ntile(4) over(order by r) nt,
  3         WIDTH_BUCKET(r, 0,  1000, 4) wb
  4  from (select * from t
  5      )
  6      order by r
  7  /

         R         NT         WB
---------- ---------- ----------
         4          1          1
        10          1          1
        37          1          1
        41          1          1
        46          1          1
        50          1          1
        61          1          1
        66          1          1
        72          1          1
        75          1          1
        77          1          1
...

with less nesting...

Matteo, September 13, 2007 - 12:16 pm UTC

Hi Tom,
the behavior changes with less nesting:

select r,
ntile(4) over(order by r) nt,
WIDTH_BUCKET(r, 0, 1000, 4) wb
from (
select trunc(dbms_random.value(0, 1000)) r
from dual connect by level <100
)
order by r

In this case it seems to work as expected.

The output is different even if the 2 queries are formally the same.

Any suggestion?
Thanks
Matteo
Tom Kyte
September 15, 2007 - 7:41 pm UTC

see above.

got it

Matteo, September 17, 2007 - 3:58 am UTC

Thanks

Overflow bucket

Zahir, May 26, 2010 - 9:28 am UTC

Tom -


Not sure if it is a documentation bug or my interpretation of the documentation . Please advise.

In the following example , the bucket for the value 5 should be 5 , not in the overflow bucket 6.

Per Document , the values greater than max_value will be placed in the overflow bucket .


drop table t purge ;
Create Table T  ( col1 integer) ;

Insert Into T Values(1) ;
Insert Into T Values(2) ;
Insert Into T Values(3) ;desc
Insert Into T Values(4) ;
Insert Into T Values(5) ;



SQL> select col1 , width_bucket(col1 , 1, 5 , 5) from t;

      COL1 WIDTH_BUCKET(COL1,1,5,5)
---------- ------------------------
         1                        1
         2                        2
         3                        3
         4                        4
         5                        6




<quote>

.....

When needed, Oracle Database creates an underflow bucket numbered 0 and an overflow bucket numbered num_buckets+1.
These buckets handle values less than min_value and more than max_value and are helpful in checking the reasonableness of endpoints.


.....

Examples


Customers with credit limits greater than the maximum value are assigned to the overflow bucket, 11
</quote>

Tom Kyte
May 26, 2010 - 10:17 am UTC

Ideally each bucket is a closed-open interval of the real number line.

It is meant to be open on the high end (<=,not just <)

A reader, June 07, 2010 - 2:31 pm UTC

Sorry . I am not following ..


If each each bucket is a closed open interval, shouldn't the result set be as

COL1 WIDTH_BUCKET(COL1,1,5,5)
---------- ------------------------
1 2
2 3
3 4
4 5
5 6


Or the entire resultset is a closed open interval ...


Tom Kyte
June 09, 2010 - 8:45 am UTC

You quoted:

When needed, Oracle Database creates an underflow bucket numbered 0 and an overflow bucket numbered num_buckets+1.
These buckets handle values less than min_value and more than max_value and are helpful in checking the reasonableness of endpoints.


there was no need for a bucket 0 - since 1 fits in the first bucket.

then buckets started numbering from 1, we got buckets 1,2,3,4

then we hit 5, 5 is not less than "your upper bound" so by definition it went into bucket number "num_buckets+1" which was 6.

There is no bucket number 5.

So, 0 - not needed.
1..4 were used for the observed values
5 was not needed, there was no data there
and 6 was used for the overflow value.

More to Explore

Analytics

Analytic SQL got you confused? Check out Connor McDonald's complete video course.