WIDTH_BUCKET with non numeric data

Thanks for the question, Bernard.

Asked: December 30, 2004 - 11:04 am UTC

Last updated: June 09, 2010 - 8:45 am UTC

Version: 9.2.0.5

Viewed 1000+ times

You Asked

Hello,

I am currently investigating histograms. As such, I use the function called WIDTH_BUCKET for dealing with skewed data.

However, It is only for numeric data.

How to deal with non-muneric data, such as DATE or CHAR/VARCHAR datatypes ?

Thx

and Tom said...

Say you have a script like this:

-------------------------- histo.sql -----------------------------------
column hist format a40

define TNAME=&1
define CNAME=&2
define BUCKETS=&3

select wb,
cnt,
to_char(round( 100*cnt/(max(cnt) over ()),2),'999.00') rat,
rpad( '*', 40*cnt/(max(cnt) over ()), '*' ) hist
from (
select wb,
count(*) cnt
from (
select width_bucket( r, 0, (select count(distinct &cname) from &tname)+1, &buckets) wb
from (
select dense_rank() over (order by &cname) r
from &tname
)
)
group by wb
)
order by wb
/
-------------------------------------------------------------------------

You can use it to visualize the relative spread of data.

for example:

ops$tkyte@ORA9IR2> @histo all_objects owner 10
ops$tkyte@ORA9IR2> set verify off
ops$tkyte@ORA9IR2> set echo off

WB CNT RAT HIST
---------- ---------- ------- ----------------------------------------
1 1 .01
2 265 1.89
3 929 6.62 **
4 238 1.70
5 998 7.11 **
6 93 .66
7 11753 83.75 *********************************
8 14033 100.00 ****************************************
9 131 .93
10 270 1.92

10 rows selected.

shows there is clumping...

ops$tkyte@ORA9IR2> @histo all_objects owner 20
ops$tkyte@ORA9IR2> set verify off
ops$tkyte@ORA9IR2> set echo off

WB CNT RAT HIST
---------- ---------- ------- ----------------------------------------
2 1 .01
3 4 .03
4 261 1.91
5 27 .20
6 902 6.61 **
7 234 1.71
8 4 .03
9 29 .21
10 969 7.10 **
11 7 .05
12 86 .63
13 11738 85.98 **********************************
14 15 .11
15 13652 100.00 ****************************************
16 381 2.79 *
17 2 .01
18 129 .94
19 270 1.98

18 rows selected.

since 18 < 20, we'd have a value per bucket and see it might be worth it -- since most of the data is 2 points

ops$tkyte@ORA9IR2> @histo all_objects created 10
ops$tkyte@ORA9IR2> set verify off
ops$tkyte@ORA9IR2> set echo off

WB CNT RAT HIST
---------- ---------- ------- ----------------------------------------
1 2804 25.49 **********
2 1347 12.25 ****
3 952 8.65 ***
4 957 8.70 ***
5 11000 100.00 ****************************************
6 6473 58.85 ***********************
7 1343 12.21 ****
8 2074 18.85 *******
9 649 5.90 **
10 1112 10.11 ****

10 rows selected.

created has some clumping as well... but:

ops$tkyte@ORA9IR2> @histo all_objects object_id 15
ops$tkyte@ORA9IR2> set verify off
ops$tkyte@ORA9IR2> set echo off

WB CNT RAT HIST
---------- ---------- ------- ----------------------------------------
1 1914 99.95 ***************************************
2 1914 99.95 ***************************************
3 1914 99.95 ***************************************
4 1914 99.95 ***************************************
5 1914 99.95 ***************************************
6 1914 99.95 ***************************************
7 1914 99.95 ***************************************
8 1915 100.00 ****************************************
9 1914 99.95 ***************************************
10 1914 99.95 ***************************************
11 1914 99.95 ***************************************
12 1914 99.95 ***************************************
13 1914 99.95 ***************************************
14 1914 99.95 ***************************************
15 1914 99.95 ***************************************

15 rows selected.

ops$tkyte@ORA9IR2>

object id is nicely distributed (should be -- primary key!)

Rating

(10 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Bernard H, December 30, 2004 - 2:45 pm UTC

Thx for this great insight.

Congratulations for this tremendous Web Site.

Moreover, it seems to be easy to reach experts (very available). I am really impressed as I was while reading your book "Effective Oracle by Design".

Keep up with this excellent work and Happy New Year 2005.

Anto, February 11, 2005 - 10:05 am UTC

Hi Tom,

Thanks for such a useful script

ops$tkyte@ORA9IR2> @histo all_objects owner 20
ops$tkyte@ORA9IR2> set verify off
ops$tkyte@ORA9IR2> set echo off

WB CNT RAT HIST
---------- ---------- ------- ----------------------------------------
2 1 .01
3 4 .03
4 261 1.91
5 27 .20
6 902 6.61 **
7 234 1.71
8 4 .03
9 29 .21
10 969 7.10 **
11 7 .05
12 86 .63
13 11738 85.98 **********************************
14 15 .11
15 13652 100.00 ****************************************
16 381 2.79 *
17 2 .01
18 129 .94
19 270 1.98

18 rows selected.

since 18 < 20, we'd have a value per bucket and see it might be worth it --
since most of the data is 2 points

a) Can you please explain the above again please ?

b) Can we estimate the optimum bucket size for histograms using this script ?

c) Can we modify this script to get the column value as well ?

thanks
Anto

February 12, 2005 - 7:44 am UTC

Ok, starting with a) Run the queries from the inside out: ops$tkyte@ORA9IR2> create table t as select * from all_objects; Table created. ops$tkyte@ORA9IR2> define TNAME=&1 ops$tkyte@ORA9IR2> define CNAME=&2 ops$tkyte@ORA9IR2> define BUCKETS=&3 ops$tkyte@ORA9IR2> ops$tkyte@ORA9IR2> select * from ( 2 select &cname, dense_rank() over (order by &cname) r 3 from &tname 4 ) where rownum <= 10 5 / OWNER R ------------------------------ ---------- B 1 BIG_TABLE 2 BIG_TABLE 2 BIG_TABLE 2 CTXSYS 3 CTXSYS 3 CTXSYS 3 CTXSYS 3 CTXSYS 3 CTXSYS 3 10 rows selected. assign a ranking (dense, 1, 2, 3... ) to the strings. width_buckets for creating histograms -- they need a number. mapping of strings to numbers in sorted order ops$tkyte@ORA9IR2> ops$tkyte@ORA9IR2> ops$tkyte@ORA9IR2> select * from ( 2 select &cname, 3 width_bucket( r, 0, (select count(distinct &cname) from &tname)+1, &buckets) wb 4 from ( 5 select &cname, dense_rank() over (order by &cname) r 6 from &tname 7 ) 8 ) where rownum <= 10 9 / OWNER WB ------------------------------ --- B 1 BIG_TABLE 1 BIG_TABLE 1 BIG_TABLE 1 CTXSYS 2 CTXSYS 2 CTXSYS 2 CTXSYS 2 CTXSYS 2 CTXSYS 2 10 rows selected. Now, put them into "buckets". we've told Oracle "here is a range of numbers - r, now, please put r into a histogram of &BUCKETS buckets, low range of the histogram is 0 and the high non-inclusive bound is count(distinct &cname)+1 -- since the upper bound is non-inclusive, we added 1" ops$tkyte@ORA9IR2> ops$tkyte@ORA9IR2> select wb, 2 count(*) cnt, 3 min(&cname) lo, 4 max(&cname) hi 5 from ( 6 select &cname, width_bucket( r, 0, (select count(distinct &cname) from &tname)+1, &buckets) wb 7 from ( 8 select &cname, dense_rank() over (order by &cname) r 9 from &tname 10 ) 11 ) 12 group by wb 13 / WB CNT LO HI --- ------ ------------------------------ ------------------------------ 1 4 B BIG_TABLE 2 263 CTXSYS DW_XPS 3 333 MDSYS OPS$TKYTE 4 998 ORDPLUGINS ORDSYS 5 93 OUTLN PERFSTAT 6 11743 PUBLIC SCOTT 7 14037 SYS SYSTEM 8 2 U1 U2 9 131 USER1 WMSYS 10 274 WSMGMT XDB 10 rows selected. once we have them in buckets -- grouping them and counting how many fall into the bucket it easy. we can even get the HI and LO values in the buck as demostrated, the remaing level in the query is just for formatting so, onto number 2 here: ops$tkyte@ORA9IR2> ops$tkyte@ORA9IR2> set echo off @wb t object_id 10 old 9: min(&cname) lo, new 9: min(object_id) lo, old 10: max(&cname) hi new 10: max(object_id) hi old 12: select &cname, width_bucket( r, 0, (select count(distinct &cname) from &tname)+1, &buckets) wb new 12: select object_id, width_bucket( r, 0, (select count(distinct object_id) from t)+1, 10) wb old 14: select &cname, dense_rank() over (order by &cname) r new 14: select object_id, dense_rank() over (order by object_id) r old 15: from &tname new 15: from t WB DATA CNT RAT HIST --- ------------------ ------ ------- ---------------------------------------- 1 3/2849 2787 99.96 *************************************** 2 2850/5722 2788 100.00 **************************************** 3 5723/8540 2788 100.00 **************************************** 4 8541/11328 2788 100.00 **************************************** 5 11329/14116 2788 100.00 **************************************** 6 14117/16904 2788 100.00 **************************************** 7 16905/19692 2788 100.00 **************************************** 8 19693/22480 2788 100.00 **************************************** 9 22481/25285 2788 100.00 **************************************** 10 25286/38285 2787 99.96 *************************************** 10 rows selected. @wb t owner 10 old 9: min(&cname) lo, new 9: min(owner) lo, old 10: max(&cname) hi new 10: max(owner) hi old 12: select &cname, width_bucket( r, 0, (select count(distinct &cname) from &tname)+1, &buckets) wb new 12: select owner, width_bucket( r, 0, (select count(distinct owner) from t)+1, 10) wb old 14: select &cname, dense_rank() over (order by &cname) r new 14: select owner, dense_rank() over (order by owner) r old 15: from &tname new 15: from t WB DATA CNT RAT HIST --- ------------------ ------ ------- ---------------------------------------- 1 B/BIG_TABLE 4 .03 2 CTXSYS/DW_XPS 263 1.87 3 MDSYS/OPS$TKYTE 333 2.37 4 ORDPLUGINS/ORDSYS 998 7.11 ** 5 OUTLN/PERFSTAT 93 .66 6 PUBLIC/SCOTT 11743 83.66 ********************************* 7 SYS/SYSTEM 14037 100.00 **************************************** 8 U1/U2 2 .01 9 USER1/WMSYS 131 .93 10 WSMGMT/XDB 274 1.95 10 rows selected. you are not using this to determine the optimal nubmer, but rather to visualize of the data is skewed... object id, appears to not be (use a &3 that fills your screen, lets you see any patterns). owner appears to definitely be. and in fact, if you used 254 for the buckets in my case, it would completely fill out since I only have 20 distinct schemas with objects. You would see that owner is a candidate for histograms by the pattern. The number of buckets -- 254 seems good. If there are less distinct values in the table, you get less. If there are more, you get as many as you can. It would be rare to have to play with that number c) done.

Anto, February 14, 2005 - 9:59 am UTC

Hi Tom,

Thanks for the detailed explanation.
I understand the scripts are useful to visualize the skewness of data

I still am not clear how to arrive at the bucket size. Is there any method for that or we have to experiment to get the optimal value?

Here is what I found from Oracle metalink

"The default number of buckets is 75. This value provides an appropriate level of detail for most data distributions. However, since the number of buckets in the histogram, the sampling rate, and the data distribution all affect the usefulness of a histogram, you may need to experiment with different numbers of buckets to obtain the best results.

If the number of frequently occurring distinct values in a column is relatively small, then it is useful to set the number of buckets to be greater than the number of frequently occurring distinct values. "

As per the above statement, the bucket size in this case for Owner column should be around 5 or 10.(based on number of frequently occurring distinct values for the column), how come you say " the bucket size of 254 seems to be good" ?

thanks
Anto

February 14, 2005 - 1:26 pm UTC

254, use the biggest one there -- if you have less than 254 discrete values, it'll scale down to a bucket/value. If you have more, you'll get the most you can.

Difference between NTILE and WIDTH_BUCKET

Zahir M, January 22, 2006 - 9:19 pm UTC

Tom ,
Could you please elaborate the difference between NTILE and WIDTH_BUCKET?

Thanks

January 23, 2006 - 9:39 am UTC

http://docs.oracle.com/docs/cd/B19306_01/server.102/b14200/functions214.htm#i1001630
<quote src=that url> Purpose WIDTH_BUCKET lets you construct equiwidth histograms, in which the histogram range is divided into intervals that have identical size. (Compare this function with NTILE, which creates equiheight histograms.) Ideally each bucket is a closed-open interval of the real number line. For example, a bucket can be assigned to scores between 10.00 and 19.999... to indicate that 10 is included in the interval and 20 is excluded. This is sometimes denoted [10, 20). </quote> width bucket -> user supplied range is used... ntile -> number of observations is the same in each group, regardless of range ops$tkyte@ORA9IR2> select r, 2 width_bucket( r, 1, 11, 3 ) wb, 3 width_bucket( r, 1, 100, 3 ) wb, 4 ntile(3) over (order by r) nb 5 from t; R WB WB NB ---------- ---------- ---------- ---------- 1 1 1 1 2 1 1 1 3 1 1 1 4 1 1 1 5 2 1 2 6 2 1 2 7 2 1 2 8 3 1 3 9 3 1 3 10 3 1 3 10 rows selected. See how width bucket very much depends on the range - it takes the user supplied range - splits it into N groups (when range is 1 <= x <100, the ranges would be 0..34, 35..67 and 68..99 - with 100 being an outlier in group 4 for example)

Ntile

Zahir M, February 03, 2006 - 9:15 pm UTC

Tom , Few questions on NTILE 1) Should NTILE returning 4 since the sal is the same ( 99 ) across all the employees in dept # 30 . 1 Select deptno , ename , sal , hiredate , 2 ntile(4) over( order by sal ) nt , 3 width_bucket( sal , 1 , 100 , 4) wb 4 from emptmp 5* where deptno = 30 SQL> / DEPTNO ENAME SAL HIREDATE NT WB ---------- ---------- ---------- --------- ---------- ---------- 30 ALLEN 99 20-FEB-81 1 4 WARD 99 22-FEB-81 1 4 JAMES 99 03-DEC-81 2 4 BLAKE 99 01-MAY-81 2 4 TURNER 99 08-SEP-81 3 4 MARTIN 99 28-SEP-81 4 4 6 rows selected. SQL> select version from v$instance; VERSION ----------------- 10.2.0.1.0 2) What is the meaning of equi-width and equi-height histogram ( as mentioned in the manual ) ?

February 06, 2006 - 12:08 am UTC

no, you've missed the intention of ntile. Ntile is a lot like (row_number/count/N)

its goal is to take a set of COUNT records and break it into N groups - each of about the same size.

You have COUNT=6
You set N = 4
with 6 things, there will be about 1 or 2 things per group.

It broke it into 4 groups for you.

comparing ntile and WIDTH_BUCKET

Matteo, September 10, 2007 - 12:08 pm UTC

Hi Tom,
I'm trying to better understand width_bucket comparing it with ntile, but I found a strange(for me) behaviour:

If I create 4 buckets from 0 to 1000 I would expect
bucket01 0..249,
bucket02 250..499,
bucket03 500..749,
bucket04 750..999

and values:
100 to bucket01
340 to bucket02
550 to bucket03
and so on.

Simulating with random numbers:

select r,
ntile(4) over(order by r) nt,
WIDTH_BUCKET(r, 0, 1000, 4) wb

from (
select trunc(dbms_random.value(0, 1000)) r
from (select level le from dual connect by level <100)
)
order by r

I get:
R NT WB
0 3 2
9 1 3
9 2 3
9 2 4
11 4 3
30 4 4
31 3 1
40 1 4
41 3 3
60 2 1
81 4 1
98 3 2
......

While I expect all these values to be on bucket01.

Can you help me understanding why?

Thanks
Matteo

September 15, 2007 - 2:17 pm UTC

side effect of calling random, in the above query, R is just dbms_random.value - and is getting re-evaluated.

do this instead:

ops$tkyte%ORA10GR2> create table t 2 as 3 select trunc(dbms_random.value(0, 1000)) r 4 from (select level le from dual connect by level <100) 5 / Table created. ops$tkyte%ORA10GR2> ops$tkyte%ORA10GR2> select r, 2 ntile(4) over(order by r) nt, 3 WIDTH_BUCKET(r, 0, 1000, 4) wb 4 from (select * from t 5 ) 6 order by r 7 / R NT WB ---------- ---------- ---------- 4 1 1 10 1 1 37 1 1 41 1 1 46 1 1 50 1 1 61 1 1 66 1 1 72 1 1 75 1 1 77 1 1 ...

with less nesting...

Matteo, September 13, 2007 - 12:16 pm UTC

Hi Tom,
the behavior changes with less nesting:

select r,
ntile(4) over(order by r) nt,
WIDTH_BUCKET(r, 0, 1000, 4) wb
from (
select trunc(dbms_random.value(0, 1000)) r
from dual connect by level <100
)
order by r

In this case it seems to work as expected.

The output is different even if the 2 queries are formally the same.

Any suggestion?
Thanks
Matteo

September 15, 2007 - 7:41 pm UTC

see above.

got it

Matteo, September 17, 2007 - 3:58 am UTC

Thanks

Overflow bucket

Zahir, May 26, 2010 - 9:28 am UTC

Tom -

Not sure if it is a documentation bug or my interpretation of the documentation . Please advise.

In the following example , the bucket for the value 5 should be 5 , not in the overflow bucket 6.

Per Document , the values greater than max_value will be placed in the overflow bucket .

drop table t purge ; Create Table T ( col1 integer) ; Insert Into T Values(1) ; Insert Into T Values(2) ; Insert Into T Values(3) ;desc Insert Into T Values(4) ; Insert Into T Values(5) ; SQL> select col1 , width_bucket(col1 , 1, 5 , 5) from t; COL1 WIDTH_BUCKET(COL1,1,5,5) ---------- ------------------------ 1 1 2 2 3 3 4 4 5 6

<quote>

.....

When needed, Oracle Database creates an underflow bucket numbered 0 and an overflow bucket numbered num_buckets+1.
These buckets handle values less than min_value and more than max_value and are helpful in checking the reasonableness of endpoints.

.....

Examples

Customers with credit limits greater than the maximum value are assigned to the overflow bucket, 11
</quote>

May 26, 2010 - 10:17 am UTC

Ideally each bucket is a closed-open interval of the real number line.

It is meant to be open on the high end (<=,not just <)

A reader, June 07, 2010 - 2:31 pm UTC

Sorry . I am not following ..

If each each bucket is a closed open interval, shouldn't the result set be as

COL1 WIDTH_BUCKET(COL1,1,5,5)
---------- ------------------------
1 2
2 3
3 4
4 5
5 6

Or the entire resultset is a closed open interval ...

June 09, 2010 - 8:45 am UTC

You quoted:

When needed, Oracle Database creates an underflow bucket numbered 0 and an overflow bucket numbered num_buckets+1.
These buckets handle values less than min_value and more than max_value and are helpful in checking the reasonableness of endpoints.

there was no need for a bucket 0 - since 1 fits in the first bucket.

then buckets started numbering from 1, we got buckets 1,2,3,4

then we hit 5, 5 is not less than "your upper bound" so by definition it went into bucket number "num_buckets+1" which was 6.

There is no bucket number 5.

So, 0 - not needed.
1..4 were used for the observed values
5 was not needed, there was no data there
and 6 was used for the overflow value.

Connor and Chris don't just spend all day on AskTOM. You can also catch regular content via Connor's blog and Chris's blog. Or if video is more your thing, check out Connor's channel and Chris's channel from their Youtube channels. And of course, keep up to date with AskTOM via the official twitter account.

More to Explore

Analytics

Analytic SQL got you confused? Check out Connor McDonald's complete video course.

Breadcrumb

Question and Answer

You Asked

and Tom said...

Rating

Comments

Difference between NTILE and WIDTH_BUCKET

Ntile

comparing ntile and WIDTH_BUCKET

with less nesting...

got it

Overflow bucket

More to Explore

Analytics