Skip to Main Content

Breadcrumb

Question and Answer

Tom Kyte

Thanks for the question, Susan.

Asked: June 27, 2002 - 12:27 pm UTC

Last updated: March 16, 2009 - 9:12 am UTC

Version: 8.1.7

Viewed 1000+ times

You Asked

Hi Tom,

We have a table that has over 3 million rows with no indexes (not my fault I'm new). There is a candidate for a primary key, but, this key is just a generic id key and won't be queried. There is another column which is not unique but is queried against often so this is the column I'm choosing for indexing. Two questions how should I size the initial extent for a big index like this and do I need a primary key index if the primary key is not queried against? Thanks.

and Tom said...

A primary key would be there to enforce business rules -- eg, that the field must be unique.

There is no "law" saying you must have a primary key -- but it is convention. If you truly do not have a primary key -- don't create one.


The only way I know to truly size something is to create a small example of it (somewhere between .01 and 10% of the data depending on size) and analyze it. See what it takes. An index can be tricky due to the fact that inserts/updates/deletes will affect it differently depending on how it is used.

An index on 3million rows where the column is a number will probably take about 75-100meg. An extent size between 1 and 10meg would be appropriate.

I use only locally managed tablespaces and prefer UNIFORM extents over system assigned ones. In my world the initial extent is the same size as the next extent always.

Rating

  (10 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

MJ, September 14, 2004 - 2:40 pm UTC

Tom, could you explain more on this issue. I have a huge table(millions rows)with lots of DML operations. What is the best practice to use index.

Thanks for your help.

MJ

---------------------------------------
An index can be tricky due to the fact that
inserts/updates/deletes will affect it differently depending on how it is used.
---------------------------------------

Tom Kyte
September 14, 2004 - 3:05 pm UTC

it is quite simple with regards to indexes, follow these two rules:

o have as few indexes as possible
o but as many as you need



indexing strategy question.

RMM, February 01, 2005 - 1:40 pm UTC

Hi Tom
I have a case as follows:
. no of tables is small.
. the tables are of "generic" design that you hate.
However, we use the design that has additional
columns in the table (and creating views on top
of them) - so at least that portion is correct:)

For indexing strategy I was thinking of doing
things by brute force:
1. populate the tables with representative data
set
2. Run tkprof on the PL/SQL procedures that I am
tuning and save them.
3. Create the indexes that I think will be helpful
and do step 2 again.
4. compare 2 and 3.

We are not terribly concerned about inserts (very
infrequent.) And there are no updates - only the
same record being replaced with a newer version in
our tables. In terms of comparision, I think
the only thing to look for really is the number
of LIOs. Do you know of any there any other indicators
when comparing two different index schemes?


Do you think above is a good strategy to follow?
I like it since it removes guess work. It may
not be feasible to do this for a large number
of tables though.



Tom Kyte
February 01, 2005 - 3:39 pm UTC

or you could use 10g and let the access advisor do the work for you....

but sure.

thanx - just one followup

RMM, February 01, 2005 - 3:43 pm UTC

"In terms of comparision, I think
the only thing to look for really is the number
of LIOs."
That is true mostly but, from your book
(expert one on one), if you use index compression
alternative you have to keep an eye on the CPU
usage as well...

The leading key.........

A reader, December 23, 2005 - 3:15 am UTC

Hi Tom,
while creating a composite b-tree index, be it unique/primary or a normal index, what should be the leading key order.
Should the low cardinality columns go in first or shpuld it be vice versa.
Say for e.g.
col A-> state (1,2,3)
col B-> city (A-Z)

Then should the index be (state,city) or (ity, state).
Irrespective of the queries. Assume that the queries will be equal in number for both of them.

Also, i was looking at the statistics for the primary key and another normal index.
Pk columns->(a,b)
normal columns->(b,a)
The clustering factor for the PK was close to its rows, whereas for the normal index was very low.
Both the columns have equal data and no nulls.
please explain


Tom Kyte
December 23, 2005 - 12:32 pm UTC

the cardinality has nothing to do with it.

how you USE the index, what QUESTIONS you ask - that is what it is all about.

Think of a copy of ALL_OBJECTS. Suppose these are the questions you ask:

"What objects does SCOTT own"
"What tables does SCOTT own"
"Show me all of the details for the TABLE named EMP owned by SCOTT"


Well, we'd like to have an index on OWNER for question 1.

On OBJECT_TYPE, OWNER for question 2.

On OBJECT_NAME, OBJECT_TYPE, OWNER for question 3.

Arguably - object_name is the most "selective" and either owner or object_type the least selective.

However, the index that makes the most sense?

(owner,object_type,object_name) - it can be used for ALL THREE.


Now, say I ask the question - "show me all object_id's greater than :X for tables"

where object_id > :x and object_type = 'TABLE'

object_id is certainly "unique", very selective, object_type is not selective at all.

But I'd want an index on (OBJECT_TYPE,OBJECT_ID) - not the other way around. I want to goto TABLES in the index, find the first OBJECT_ID for a table > :x, and range scan from there (everything - EVERYTHING I hit in the range scan is "data I want").

If we indexed (object_id,object_type) - we'll, we would range scan every OBJECT_ID > :x and then have to ask each index entry "are you a table???" we'd process ton of index entries we didn't care about.




As for the clustering factor - if the cluster factor is near the number of blocks in the table - that indicates the "data in the table is sorted by the index key". If the cluster factor is near the number of ROWS in the table - that indicates "the data in the table is not stored sorted by the index key"


Remember - data can be sorted by how many keys? Pretty much the answer is "1" - I would be surprised if the data in the table was sorted by

A,B

AND

B,A

In the case of a primary key - say your primary key is (bad key, just for ILLUSTRATION, lets not discuss the pros/cons of using timestamp as part of a key :)

CUSTOMER_ID, TRANSACTION_TIMESTAMP


Now, the data tends to get inserted into this table by transaction_timestamp. customer_ids however arrive randomly. If you did a "select *" from this table, you would likely see the data coming out mostly sorted by timestamp (not entirely, but mostly)

Hence, an index on (transaction_timestamp,customer_id) would have a "very good clustering factor" - becaue the index key and the table data are very much "in the same order"

An index on (customer_id,transaction_timestamp) on the other hand would have a "very bad clustering factor" since the data is not stored sorted by customer_id in the table.

But I used AIR QUOTES on purpose. The clustering factor is just a number - if your most important query was:

select *
from t
where customer_id = :x
and transaction_timestamp >= sysdate-7


I would sure want the index to be on (customer_id,transaction_timestamp)!!!! There is no way I'd want it the other way around.

Now, we could get into a discussion on "how to keep the data organized by customer_id" and so on - but I'll point you to a book for that.

This discussion and most about physical structures is part of my old book "Expert one on one Oracle" (the index discussion, clustering of data and so on) as well as the updated version for 9i/10g in Expert Oracle: Database Architecture.

You saved my skin.....

A reader, December 24, 2005 - 1:30 am UTC

Not to mention other parts of my body.
I got into an argument with a colleague of mine on this issue of indexes. I told him that the order of columns should be based on the number of queries rather than cardinality. But then.... he never believed me.
Then I thought lets get OMV's opinion.
OMV=> Oracle Master's Voice ;)

And, am I glad to say, BOY was i Right.
Need to get him to read this answer. Thanks as always..



Happy New Year!

Mark, December 29, 2005 - 11:06 am UTC

Hi Tom,

Hope you had a great Holiday and an upcoming New Year!

Thanks for all your help and it looks like we are FINALLY moving to 9i!

Whahoo!

:)


In (select clause) vs in (v1,v2,v3...v1000)

A reader, November 21, 2007 - 4:37 am UTC


Hi Tom,
Is there a difference between the appraoch if we use in with a select clause and in with a variable list.
We have a query on a small table 30ok rows. It uses a in list on 1000 values. it is like:
select * from tab
where col_x in (v1,v2....v1000)
order by col_y.

The table has 5 columns. The columns x and y have been indexed in Ind_1 (x,y) and also Ind_2 (y,x). The records fetched by query is very small, say max. 800 out of 300K records. The stats are the latest and yet the table does not use any of the indexes and goes for full table scan.

But when I give the index (ind_1) as a hint. it returns the rows very fast.

I know full table scans are not bad, but in this case I want it to use the index. As with the index being used it fetches in 29 secs as compared to 1-2 mins.

Can you please advise?

Thanks as always.


NOT in and <>, !=

A reader, March 13, 2009 - 12:46 am UTC


Hi Tom,

What should be the strategy for indexing columns which are referenced in the where clause using "NOT IN", "<>", "!="

Most of the time I have seen it is best to avoid the columns from being indexed, but other experiences may vary.

Would like to know your opinion/suggestions on the same.

Thanks.

Tom Kyte
March 13, 2009 - 7:31 am UTC

it would be highly unusual to use an index for those operations - you would typically not index them for that particular operation.

think about it - how would you foresee an index being useful in those cases?

NOT in and <>, !=

A reader, March 13, 2009 - 11:09 pm UTC


Thanks for the updates. Exactly what I had in mind. But I think, my question was incorrect. To rephrase it:

In a where clause out of say: 6 clauses 2 use either of NOT in and <>, !=

As per the data being fetched it is best use the index for retreiving the data.

So keeping that in mind, we should index all the required colums except the ones referred in the NOT in and <>, != clause.

Please correct me if I am wrong.

Thanks.

Tom Kyte
March 16, 2009 - 9:12 am UTC

It is not entirely clear what you are saying, but I think you are saying "if we have a where clause with six AND'ed predicates, and 4 of them use things we 'conventionally think - we should index' and two of them use these NOT IN or whatever constructs, the index should only contain the four fields".


Maybe, maybe not. "IT DEPENDS" (one of two answers to any technical question, the other being "why?")


select a, b, c, d, x, y, z
from t
where a = 5 and b = 10 and c = 15 and d = 20 and x not in (1,2) and y <> 42;


would it make sense to index (a,b,c,d,x,y) - sure, it could, it might even make sense to index (a,b,c,d,x,y,z)


The reason - to avoid hitting the table when not necessary (a,b,c,d,x,y) can do that. If there are 1,000,000 records out of 100,000,000 where "a = 5 and b = 10 and c = 15 and d = 20" is true, but out of that 1,000,000 only 2 are such that "x not in (1,2) and y <> 42" is true - we might want to discover that before hitting the table 1,000,000 times. We can answer the where clause against the INDEX instead of banging on the table 1,000,000 times to see what x and y are.

And if you add Z, we can avoid the table entirely (wouldn't make sense in my example necessarily because we only visit it twice, but if out of the 1,000,000 times - we found 500,000 hits for the x/y predicate - we might consider adding Z to the index as well)

See the progression of plans in the following:


 autotrace traceonly explain
ops$tkyte%ORA10GR2> create index t_idx on t(a,b,c,d);

Index created.

ops$tkyte%ORA10GR2> select a,b,c,d,x,y,z from t where a = 5 and b = 10 and c = 15 and d = 20 and x not in (1,2) and y <> 42;

Execution Plan
----------------------------------------------------------
Plan hash value: 470836197

------------------------------------------------------------------------
| Id  | Operation                   | Name  | Rows  | Bytes | Cost (%CPU
------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |       |     1 |    91 |     1   (0
|*  1 |  TABLE ACCESS BY INDEX ROWID| T     |     1 |    91 |     1   (0
|*  2 |   INDEX RANGE SCAN          | T_IDX |     1 |       |     1   (0
------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("X"<>1 AND "X"<>2 AND "Y"<>42)
   2 - access("A"=5 AND "B"=10 AND "C"=15 AND "D"=20)

<b>Hit index to evaluate a,b,c,d - go to table to evaluate x,y and pick up z</b>

ops$tkyte%ORA10GR2> drop index t_idx;

Index dropped.

ops$tkyte%ORA10GR2> create index t_idx on t(a,b,c,d,x,y);

Index created.

ops$tkyte%ORA10GR2> select a,b,c,d,x,y,z from t where a = 5 and b = 10 and c = 15 and d = 20 and x not in (1,2) and y <> 42;

Execution Plan
----------------------------------------------------------
Plan hash value: 470836197

------------------------------------------------------------------------
| Id  | Operation                   | Name  | Rows  | Bytes | Cost (%CPU
------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |       |     1 |    91 |     0   (0
|   1 |  TABLE ACCESS BY INDEX ROWID| T     |     1 |    91 |     0   (0
|*  2 |   INDEX RANGE SCAN          | T_IDX |     1 |       |     0   (0
------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("A"=5 AND "B"=10 AND "C"=15 AND "D"=20)
       filter("X"<>1 AND "X"<>2 AND "Y"<>42)

<b>hit index to evaluate a,b,c,d and then filter on x,y and THEN hit table to pick up Z</b>

ops$tkyte%ORA10GR2> drop index t_idx;

Index dropped.

ops$tkyte%ORA10GR2> create index t_idx on t(a,b,c,d,x,y,z);

Index created.

ops$tkyte%ORA10GR2> select a,b,c,d,x,y,z from t where a = 5 and b = 10 and c = 15 and d = 20 and x not in (1,2) and y <> 42;

Execution Plan
----------------------------------------------------------
Plan hash value: 2946670127

------------------------------------------------------------------------
| Id  | Operation        | Name  | Rows  | Bytes | Cost (%CPU)| Time
------------------------------------------------------------------------
|   0 | SELECT STATEMENT |       |     1 |    91 |     0   (0)| 00:00:01
|*  1 |  INDEX RANGE SCAN| T_IDX |     1 |    91 |     0   (0)| 00:00:01
------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("A"=5 AND "B"=10 AND "C"=15 AND "D"=20)
       filter("X"<>1 AND "X"<>2 AND "Y"<>42)

<b>just hit the index, no table</b>

ops$tkyte%ORA10GR2> set autotrace off

Indexing strategy for table with more than 100 columns

Parthiban Nagarajan, October 20, 2009 - 6:16 am UTC

Hi Tom

Consider a table that stores STUDENT's detail. It has a column STUDENT_ID which is the primary key. That table has around 105 columns. And each and every column is indexed. But it does not mean no_of_columns = no_of_indexes. We do have this equation: no_of_columns < no_of_indexes (yes, we also have FBIs associated).

I am stunned to see this approach. Its been around 10 to 15 years (and I am very new here). What I guess is - to make search query (web based and hence dynamically built) faster, they would have chosen this.

Why people are thinking like RBO (index access is best) ? Could you advise on this ?
-- ---------------
-- Some more info
-- ---------------
-- This is an OLTP system
-- We have some more tables that satisfy: no_of_columns < no_of_indexes
-- But mostly like this: no_of_columns / 3 <= no_of_indexes
-- PK columns are not having good clustering_factor but DATE columns

Thanks in Advance ;)