You Asked
After computing statistics on an index, the "clustering_factor" is available when querying DBA_INDEXES. A book I have
states: "If the clustering factor is greater than the number of blocks in the index, then the index is out of balance due to a large volume of insert and delete operations." (High Performance Oracle8 Tuning, Burleson, p. 316, if you have it.)
First, can you give a better explanation of what the clustering_factor is? Then, which blocks do you think they mean: index leaf blocks or index physical blocks?
I have another (same subject). The book goes on to say "If the clustering factor is more than 50 percent of the number of rows
in the table that it is indexing, you should consider (rebuilding) the index." Here, are they talking about num_rows from DBA_INDEXES or the actual (counted) number of rows in the table?
One more. It says "If the value for the clustering factor approaches the number of blocks in the base table, then the index is said to be clustered. If the clustering factor is greater than the number of blocks in the base table and approaches the number of rows in the base table, then the index is unclustered." OK: so which one is good and when do we consider rebuilding?
Thanks for helping.
and Tom said...
From the Reference manual:
<quote>
CLUSTERING_FACTOR:
Indicates the amount of order of the rows in the table based
on the values of the index.
o If the value is near the number of blocks, then the table
is very well ordered. In this case, the index entries in a
single leaf block tend to point to rows in the same data
blocks.
o If the value is near the number of rows, then the table
is very randomly ordered. In this case, it is unlikely
that index entries in the same leaf block point to rows
in the same data blocks.
</quote>
I do not believe the clustering factor they talk of is the clustering_factor in the dba-indexes but rather some metric they have made up. Not having the book referred to -- I cannot confirm this.
If the clustering factor they are talking of is the clustering_factor from dba_indexes, then I do not see how they can come to the conclusion that the "index is out of balance" from that. It is a factor that looks at the index entries on blocks and looks at the rowids they point to. It is telling us how close together the data in the table these index entries point to are (how sorted the table is with respect to this index).
Let's say you have a really small block size and only 3 index entries fit on a block. On a given block -- you have the values 1, 2, and 3. The value 1 points to a block in the first extent of the TABLE. The value 2 points to a block in the second extent of the TABLE. The value 3 points to a block in the third extent of the table. The clustering factor, if all other blocks are like this block, would be near the number of rows in the table -- since none of the index entries on this block point to rows that are physically close to eachother in the table.
If they go on to say -- rebuild the index based on this value -- something is wrong in their thinking if you ask me. No matter how many times I rebuild that index -- the values 1, 2 and 3 are always going to end up in the same index block and they will still point to blocks that are very far away from eachother. The clustering factor will not be materially affected by an index rebuild (it might get marginally smaller as the index is compacted -- it might get worse as well since more index entries might fit into a block after the rebuild). The clustering_factor is a measure of how physically ordered the table is with respect to this index! Not how out of balance an index is.
Here is an example. I have a table EMP with 21,000+ rows in it. I built this table via:
create table emp
as
select object_id empno, owner||'.'||object_name ename,
object_type city, mod(object_id,100) deptno,
status division
from all_objects
/
create unique index emp1 on emp(empno);
create index emp2 on emp(ename);
create index emp3 on emp(deptno);
Now, lets look at the indexes:
ops$tkyte@DEV816> analyze index emp1 compute statistics;
ops$tkyte@DEV816> analyze index emp2 compute statistics;
ops$tkyte@DEV816> analyze index emp3 compute statistics;
ops$tkyte@DEV816> select index_name, clustering_factor, num_rows
2 from user_indexes where index_name like 'EMP_';
Index
Name CLUSTERING_FACTOR NUM_ROWS
------------------------------ ----------------- ----------
EMP1 19632 21442
EMP2 186 21442
EMP3 12000 21442
ops$tkyte@DEV816> analyze table emp compute statistics;
ops$tkyte@DEV816> select blocks from user_tables where
table_name = 'EMP';
BLOCKS
----------
171
No matter what index I rebuild, or how many ways I rebuild it, the results of the above will be the same. What I can "glean" from the information above is that:
o the data in the TABLE emp appears to be sorted physically by the columns in the index EMP2 since the clustering factor of EMP2 is very near the number of blocks. EMP2 is the index on OWNER||'.'||OBJECT_NAME from ALL_OBJECTS. This makes sense if you run the all_objects view, data tends to come out sorted by owner, object_name due to the way the view is processed.
o the data in the TABLE emp is very randomly organized with respect to EMP1 -- the index on empno. The object_ids are coming out of that table very randomly. So, the TABLE data is not sorted by empno (object_id).
o the data in the TABLE emp is sort of organized physically in order by DEPTNO. Its sort of in the middle.
ops$tkyte@DEV816> alter index emp2 rebuild;
ops$tkyte@DEV816> analyze index emp2 compute statistics;
ops$tkyte@DEV816> select index_name, clustering_factor, num_rows from
2 user_indexes where index_name like 'EMP_';
Index
Name CLUSTERING_FACTOR NUM_ROWS
------------------------------ ----------------- ----------
EMP1 19632 21442
EMP2 186 21442
EMP3 12000 21442
ops$tkyte@DEV816>
The numbers will always come out the same.
The clustering_factor can only be used to guess if using the index to read the table lots will cause the table to be read almost in order or not. For example:
select * from emp order by ename;
if I use the index emp2 to process that -- the data in EMP pointed to index entries in the first index leaf block I process will mostly be on the same block(s). I will do a physical read of the table and then many logical IO's of that block before moving onto the next block. It will be fairly efficient to range scan large ranges with TABLE ACCESS BY ROWIDS using emp2.
On the other hand, if I have:
select * from emp order by object_id;
The index EMP1 is not very good for reading that table. The first entry in the index will point to block 1, the second to block 100, the third to block 500 and so on -- to process the N rows on the first index leaf block, I'll have to do N physical IO's to get the blocks from disk probably. Not only that but they will probably get flushed from the cache since I won't need them again right away and I'll have to re-read them once again later. It will be horribly inefficient to range scan LARGE ranges with TABLE ACCESS BY ROWIDS using EMP1.
I myself do not really believe in the rule of "rebuild your indexes". I like the alter index rebuild command to MOVE indexes, but I do not in practice rebulid indexes.
Rating
(8 ratings)
Is this answer out of date? If it is, please let us know via a Comment