Skip to Main Content
  • Questions
  • REBUILDING INDEXES: clustering_factor

Breadcrumb

Question and Answer

Tom Kyte

Thanks for the question, James.

Asked: September 01, 2000 - 4:58 pm UTC

Last updated: February 18, 2004 - 9:09 pm UTC

Version: 8.0.3.0.0

Viewed 1000+ times

You Asked

After computing statistics on an index, the "clustering_factor" is available when querying DBA_INDEXES. A book I have
states: "If the clustering factor is greater than the number of blocks in the index, then the index is out of balance due to a large volume of insert and delete operations." (High Performance Oracle8 Tuning, Burleson, p. 316, if you have it.)

First, can you give a better explanation of what the clustering_factor is? Then, which blocks do you think they mean: index leaf blocks or index physical blocks?

I have another (same subject). The book goes on to say "If the clustering factor is more than 50 percent of the number of rows
in the table that it is indexing, you should consider (rebuilding) the index." Here, are they talking about num_rows from DBA_INDEXES or the actual (counted) number of rows in the table?

One more. It says "If the value for the clustering factor approaches the number of blocks in the base table, then the index is said to be clustered. If the clustering factor is greater than the number of blocks in the base table and approaches the number of rows in the base table, then the index is unclustered." OK: so which one is good and when do we consider rebuilding?

Thanks for helping.

and Tom said...

From the Reference manual:

<quote>
CLUSTERING_FACTOR:

Indicates the amount of order of the rows in the table based
on the values of the index.

o If the value is near the number of blocks, then the table
is very well ordered. In this case, the index entries in a
single leaf block tend to point to rows in the same data
blocks.

o If the value is near the number of rows, then the table
is very randomly ordered. In this case, it is unlikely
that index entries in the same leaf block point to rows
in the same data blocks.
</quote>

I do not believe the clustering factor they talk of is the clustering_factor in the dba-indexes but rather some metric they have made up. Not having the book referred to -- I cannot confirm this.

If the clustering factor they are talking of is the clustering_factor from dba_indexes, then I do not see how they can come to the conclusion that the "index is out of balance" from that. It is a factor that looks at the index entries on blocks and looks at the rowids they point to. It is telling us how close together the data in the table these index entries point to are (how sorted the table is with respect to this index).

Let's say you have a really small block size and only 3 index entries fit on a block. On a given block -- you have the values 1, 2, and 3. The value 1 points to a block in the first extent of the TABLE. The value 2 points to a block in the second extent of the TABLE. The value 3 points to a block in the third extent of the table. The clustering factor, if all other blocks are like this block, would be near the number of rows in the table -- since none of the index entries on this block point to rows that are physically close to eachother in the table.

If they go on to say -- rebuild the index based on this value -- something is wrong in their thinking if you ask me. No matter how many times I rebuild that index -- the values 1, 2 and 3 are always going to end up in the same index block and they will still point to blocks that are very far away from eachother. The clustering factor will not be materially affected by an index rebuild (it might get marginally smaller as the index is compacted -- it might get worse as well since more index entries might fit into a block after the rebuild). The clustering_factor is a measure of how physically ordered the table is with respect to this index! Not how out of balance an index is.

Here is an example. I have a table EMP with 21,000+ rows in it. I built this table via:


create table emp
as
select object_id empno, owner||'.'||object_name ename,
object_type city, mod(object_id,100) deptno,
status division
from all_objects
/

create unique index emp1 on emp(empno);
create index emp2 on emp(ename);
create index emp3 on emp(deptno);


Now, lets look at the indexes:

ops$tkyte@DEV816> analyze index emp1 compute statistics;
ops$tkyte@DEV816> analyze index emp2 compute statistics;
ops$tkyte@DEV816> analyze index emp3 compute statistics;

ops$tkyte@DEV816> select index_name, clustering_factor, num_rows
2 from user_indexes where index_name like 'EMP_';

Index
Name CLUSTERING_FACTOR NUM_ROWS
------------------------------ ----------------- ----------
EMP1 19632 21442
EMP2 186 21442
EMP3 12000 21442


ops$tkyte@DEV816> analyze table emp compute statistics;
ops$tkyte@DEV816> select blocks from user_tables where
table_name = 'EMP';

BLOCKS
----------
171


No matter what index I rebuild, or how many ways I rebuild it, the results of the above will be the same. What I can "glean" from the information above is that:

o the data in the TABLE emp appears to be sorted physically by the columns in the index EMP2 since the clustering factor of EMP2 is very near the number of blocks. EMP2 is the index on OWNER||'.'||OBJECT_NAME from ALL_OBJECTS. This makes sense if you run the all_objects view, data tends to come out sorted by owner, object_name due to the way the view is processed.

o the data in the TABLE emp is very randomly organized with respect to EMP1 -- the index on empno. The object_ids are coming out of that table very randomly. So, the TABLE data is not sorted by empno (object_id).

o the data in the TABLE emp is sort of organized physically in order by DEPTNO. Its sort of in the middle.


ops$tkyte@DEV816> alter index emp2 rebuild;
ops$tkyte@DEV816> analyze index emp2 compute statistics;
ops$tkyte@DEV816> select index_name, clustering_factor, num_rows from
2 user_indexes where index_name like 'EMP_';

Index
Name CLUSTERING_FACTOR NUM_ROWS
------------------------------ ----------------- ----------
EMP1 19632 21442
EMP2 186 21442
EMP3 12000 21442
ops$tkyte@DEV816>

The numbers will always come out the same.

The clustering_factor can only be used to guess if using the index to read the table lots will cause the table to be read almost in order or not. For example:

select * from emp order by ename;

if I use the index emp2 to process that -- the data in EMP pointed to index entries in the first index leaf block I process will mostly be on the same block(s). I will do a physical read of the table and then many logical IO's of that block before moving onto the next block. It will be fairly efficient to range scan large ranges with TABLE ACCESS BY ROWIDS using emp2.

On the other hand, if I have:

select * from emp order by object_id;

The index EMP1 is not very good for reading that table. The first entry in the index will point to block 1, the second to block 100, the third to block 500 and so on -- to process the N rows on the first index leaf block, I'll have to do N physical IO's to get the blocks from disk probably. Not only that but they will probably get flushed from the cache since I won't need them again right away and I'll have to re-read them once again later. It will be horribly inefficient to range scan LARGE ranges with TABLE ACCESS BY ROWIDS using EMP1.


I myself do not really believe in the rule of "rebuild your indexes". I like the alter index rebuild command to MOVE indexes, but I do not in practice rebulid indexes.

Rating

  (8 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Is fair to say the smaller of clustering factor is the better index??

Mike, August 31, 2001 - 10:02 pm UTC


Tom Kyte
September 01, 2001 - 9:21 am UTC

Well, assuming that you have more rows than blocks (eg: rowsize < blocksize) -- MAYBE. If you have less rows then blocks (rowsize > blocksize) -- not so.

This factor relates to how efficient this index is for "large range scans". If the number of rows to be returned via an index range is small -- the cluster factor does not come into play. If the number of rows to be returned via an index range scan is deemed LARGE -- the cluster factor will come into play. The optimizer might start choosing a full scan over an index range scan for an index that does not have a good cluster factor (due to the fact that the resulting random IO to many single blocks here and there would be slower then just reading the table in the first place).

So, the cluster factor is used to determine whether an index is a good candidate for LARGE range scans. For normal index usage (get me my small set of rows) it doesn't come into play as much.

Which view can show the rowsize?

Mike, September 01, 2001 - 10:12 am UTC


rowsize is AVG_ROW_LEN in DBA_TABLES?

Mike, September 01, 2001 - 2:35 pm UTC


Tom Kyte
September 01, 2001 - 4:53 pm UTC

Thats the average row length in bytes, yes. Only valid after an ANALYZE.

Very interesting

Doug Cowles, October 20, 2001 - 3:32 pm UTC

I've alwasy wondered how indexes can be rebuilt and get larger clustering factors.

REBUILDING INDEXES: clustering_factor

Christophe Tsobgny, February 06, 2002 - 6:23 am UTC

Hello

As it is almost impossible to ask tom a question being
in Europe and having a different time zone,
i did a search on indexes and improving query
response time.
The article was very useful and helped confirm
my understanding of the clustering factor.
But will it really improve query response time
if one has a good clustering factor ?
i have one query that has a good execution plan
but the response time is always bad.

kind regards

Tom Kyte
February 06, 2002 - 8:12 am UTC

Well, its not a factor of being in Europe or India or anywhere really (i've run the stats -- I've a pretty even distribution of questions from everywhere). Its more related to the fact that about 4,000 distinct people come here everyday and on a good day I'll take as many as 40 questions....

If you are interested in examples showing the differences with the clustering factor -- i have some in my book. However -- i would start more simply with tkprof and sql_trace and see what the query is doing, see what the waits (if any are). Chapter 10 of my book would take you through that in detail,

</code> http://asktom.oracle.com/~tkyte/tkprof.html <code>

would get you started as well

Mark of a genius

Pawan, February 06, 2002 - 8:31 am UTC

It was very confusing for me to understand the clustering factor but Tom made it so simple.

Thanks

Better for data warehousing than OLTP

Dave, February 06, 2002 - 1:21 pm UTC

Here's my two-cents.

I always think of reducing the clustering factor by physical row ordering as a quick-and-dirty partitioning feature, or at least a complement to partitioning. Here's a practical example ...

Say you are loading up a large fact table in a data warehouse with 10million rows per daily load. It would be common practice to use rolling partitions to provide one partition per day, with local indexing. If users commonly access the data according to a particular column -- let's say it is by zip code -- then physically ordering your daily data load by zip code gives a low clustering factor on the zip_code index and, as Tom says, encourages large index range scans. It would probably be impractical to partition by day and then by zip code, but you get some of the same perfomance enhancement as you might do through partition elimination. A major benefit is that you can go further and choose, for example, four attributes that users commonly access the data by, and physically order the rows by each (ie. insert into ... select ... order by attribute_1, attribute_2, attribute_3, attribute_4;).

I've reduced the number of blocks read by common queries by a factor of 25 by doing this, and reduced device I/O measured at the O/S level by a factor of 8. Not to be sniffed at.

Of course you get into a case of diminishing returns, but it's more true in DW than in OLTP that you insert once, select many times, and an extra bit of work in data load can go a long way towards getting response time down.



this looks the other way round

A reader, February 18, 2004 - 9:51 am UTC

Hi I have a query in two servers, one fast and one slow. The funny thing is both have same configuration (same init.ora settings). In fact the slow database is a copy of the fast one made last week! The only difference I see is clustering factor in the indexes.

SLOW DATABASE:

OWNER TABLE_NAME INDEX_NAME CLUSTERING_FACTOR NUM_ROWS
------------ ----------------- ------------------ ----------------- ----------
SYSADM PSWORKLIST PS0PSWORKLIST 2358 75593
SYSADM PSWORKLIST PSAPSWORKLIST 2218 75593
SYSADM PSWORKLIST PSBPSWORKLIST 17453 75593
SYSADM PSWORKLIST PS_PSWORKLIST 2218 75600


FAST DATABASE
OWNER TABLE_NAME INDEX_NAME CLUSTERING_FACTOR NUM_ROWS
------------ ------------------------------ ------------------------------ ----------------- ----------
SYSADM PSWORKLIST PS0PSWORKLIST 2358 75593
SYSADM PSWORKLIST PSAPSWORKLIST 2218 75593
SYSADM PSWORKLIST PSBPSWORKLIST 17453 75593
SYSADM PSWORKLIST PS_PSWORKLIST 2218 75600



The thing is PSBPSWORKLIST in the slow database shows a low clustering factor so this index should be used and otherway round in the fast database. But in reality Oracle full scans PSWORKLIST in the slow database and index range scan in the fast database!!! Why so?!

Tom Kyte
February 18, 2004 - 9:09 pm UTC

umm, query - autotrace explain only output and tkprofs? (and settings of optimizer_*)


but it does sound "normal", cluster factor near number of blocks = range scan 'good looking'. cluster factor near number of rows = range scan 'not appealing', bulky operations (full scans, fast full scans, etc) more appealing.