Skip to Main Content
  • Questions
  • Composite primary key on many columns

Breadcrumb

Question and Answer

Chris Saxon

Thanks for the question, Saurabh.

Asked: June 13, 2023 - 5:27 am UTC

Last updated: June 19, 2023 - 2:25 pm UTC

Version: 19C

Viewed 1000+ times

You Asked

I have a table where a combination of 8 varchar columns uniquely (case insensitively) identifies a record. There is no other column that can act as a primary key. Moreover, the table is refreshed fully every time from the feed which means we cannot add a sequence to act as a primary key. I was thinking to add one column which contains some kind of unique deterministic hash value for the 8 identifier columns and can act as a primary. Could you please let me know if this is possible?

and Chris said...

You can do this by:

- Creating a virtual column that
- upper/lowercases the columns concatenated together
- passes this to standard_hash
- Making this hash the primary key

create table t (
  c1 varchar2(10), c2 varchar2(10), c3 varchar2(10),
  data_hash as ( 
    standard_hash ( lower ( c1 ) || lower ( c2 ) || lower ( c3 ) )
  ) constraint pk primary key
);

insert into t ( c1, c2, c3 ) values ( 'TEST', 'TEST', 'TEST' );
insert into t ( c1, c2, c3 ) values ( 'test', 'test', 'test' );

--ORA-00001: unique constraint (CHRIS.PK) violated

select * from t;
/*
C1         C2         C3         DATA_HASH                               
---------- ---------- ---------- ----------------------------------------
TEST       TEST       TEST       0071877D20A65C02D9A1654F109B97DC61416D1A
*/


This uses SHA1 by default. As with all hashing algorithms, there is a tiny chance of collisions. If this concerns you, you could use SHA256 which reduces the chance further.

Rating

  (6 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Several problems with that approach

mathguy, June 13, 2023 - 1:54 pm UTC

The hash of NULL is non-NULL, so that approach would allow all column values (in the supposed PK) to be NULL without throwing an error.

Then: simple concatenation, without some separator that is guaranteed to be impossible in the column values, will prevent the insertion of validly distinct keys. For example: ('a', 'b', 'cde') is distinct from ('ab', 'cd', 'e') - but after concatenation they become the same and are flagged as duplicates.

Even if we use a separator like & - so that ('a', 'b', 'cde') becomes 'a&b&cde' - we still have the same problem, as long as & is possible in the actual column values. If we do use a separator, it should be a control character like chr(31), the "unit separator" control character.

Also, it's hard to see how concatenation, with or without separator, followed by hash, can enforce that every value in a key column is non-NULL - even if there may be some way to reject a row where all key values are NULL.

One possible solution is to create 8 virtual columns, on lower(c1), lower(c2) etc., and set the primary key on these columns. This avoids the collision concern too.
Chris Saxon
June 14, 2023 - 12:22 pm UTC

You can define the individual columns to be NOT NULL - no need to do anything fancy in the hashing process.

Good point on distinct rows having the same concatenated string; though I suspect like hash collisions this is highly unlikely in practice. The OP will have to assess whether this is an issue for them and come up with a suitable separator.

I suspect they're actively trying to avoid an 8 column primary key; this could be to keep the index small and/or make foreign keys more practical.

John, June 13, 2023 - 9:07 pm UTC

Maybe I'm missing something, but what is wrong with a multi column unique key in combination with an identity column as primary key?
Chris Saxon
June 14, 2023 - 12:35 pm UTC

the table is refreshed fully every time from the feed

So the process is DELETE/TRUNCATE + INSERT => new identity column values every load. Which means rekeying all child rows, etc.

It's possible the process could be rewritten as a merge to avoid this problem. Whether that's practical is something the OP would have to clarify.

A reader, June 14, 2023 - 5:01 am UTC

Thanks Chris for a quick response. I think I can take it from here.

Another approach - set collation to a case-insensitive one

mathguy, June 14, 2023 - 5:29 am UTC

If your Oracle version is at least 12.2 and the MAX_STRING_SIZE is set to EXTENDED (a requirement I never understood), then there is another possible solution: set the collation to a case-insensitive variant, such as BINARY_CI, either for the entire table (with the DEFAULT COLLATION clause) or for the 8 columns individually (with the COLLATE clause) if the other columns of the table should not be affected. Then the PK can be set for the 8 columns and the effect is exactly what you need.

This will have other effects too; case-insensitive comparison to values in these columns will be performed in WHERE clause and join conditions, foreign keys pointing to your table will be checked using case-insensitive comparison, etc. You need to decide if these other effects are desired or not; in most cases they should be, if your PK values should be viewed as case-insensitive.

Also, if your Oracle version is at least 12.2 but your MAX_STRING_SIZE is STANDARD, you can change that to EXTENDED but that is irreversible, and it has several consequences, not all of which are harmless. But if your MAX_STRING_SIZE is already set to EXTENDED for other reasons, this is definitely something you may want to consider.
Chris Saxon
June 14, 2023 - 12:32 pm UTC

If your Oracle version is at least 12.2 and the MAX_STRING_SIZE is set to EXTENDED (a requirement I never understood)

This is because the collation of text of length N could be much longer than N. Collations are RAW values which - with non-extended data types - have a limit of 2,000 bytes. Meaning you can't (safely) collate strings >= 2,000 bytes and some shorter than that too.

Allowing longer VARCHAR2 & RAW values gives much more leeway, though doesn't resolve the problem entirely.

https://blogs.oracle.com/sql/post/how-to-do-case-insensitive-and-accent-insensitive-search-in-oracle-database#long-strings

Unique key and Identity column

Saša Petković, June 14, 2023 - 11:22 am UTC

I can only agree with John who suggested that we have unique key with all 8 columns and one identity column acting as PK, to me by far best solution.
Chris Saxon
June 14, 2023 - 12:36 pm UTC

Perhaps - we have very limited information about the overall process. Building a hash of the columns could be the most practical solution available.

Reasons not to hash a concatenation of PK columns

mathguy, June 17, 2023 - 11:29 pm UTC

There are several reasons to dislike the approach of concatenating PK columns after applying LOWER and then hashing the result and using the hash as PK, in addition to the (remote) possibility of collisions - caused either by concatenation or by hashing. It's not clear that the OP has a "space" or "key size" concern, but even if he does, it may not be as serious as the downside of the concat + hash solution.

If we need a child table with a foreign key pointing to this table, we will need to concatenate and hash the FK columns and point the hash to the PK. Perhaps that isn't too much work, but it doesn't feel natural.

In any case, this creates a problem that alone should disqualify this approach, if the PK is indeed meant as "PK" (as in, needed for referential integrity). Namely, in a composite PK/FK arrangement, a FK tuple should be allowed (per SQL standard and Oracle definition) if the FK has one or more NULL values, and the remaining values match at least one PK tuple. For example, (101, 3000) as PK and (101, NULL) in the FK is allowed. There is no way to hack the "concatenate + hash" approach to force it to behave in this way.

Another issue: Suppose a query has a condition like ... where lower(COL1) = 'abc'. With the BINARY_CI collation on the column, or with a virtual column on lower(COL1) (as part of a PK on virtual columns), the runtime can reference the index that supports the PK. There is no way to do that with the PK defined as the hash of a concatenation.

One can probably think of other inconveniences of the concat + hash approach. If this business of case-insensitive collation is really that important to the OP, perhaps they should just bite the bullet and change MAX_STRING_SIZE (if it isn't EXTENDED already).
Chris Saxon
June 19, 2023 - 2:25 pm UTC

we will need to concatenate and hash the FK columns and point the hash to the PK

You mean duplicate the 8 columns in any child tables too? Why? The hash is now a surrogate key, you only need to reference that in the child tables.

For example, (101, 3000) as PK and (101, NULL) in the FK is allowed.

I think that's a weak argument. In my experience explaining this, most people are both surprised this is possible and actively want to avoid this behaviour. Particularly when they realize it means you can insert values that don't exist in the parent at all!

I've yet to see a real-world case where you want this. Typically with multi-column FKs you want a check constraint to ensure either they're all null or none are.

the runtime can reference the index that supports the PK

That's only sure to help if the leading column of the PK is in the join/where clause. Query on any other subset of the columns and you'll likely need to create indexes to match the predicates.

Granted if you have these queries you'll likely need to do this anyway whichever approach you take. The point is it's not a given that the 8-column constraint solves all indexing needs.

To be clear: using column-based collation and defining the 8-column PK would be my preferred approach.

Building a single-column hash of the column values and making this the PK does have advantages though. The big one being is the hash could be much smaller than the text in the input columns. If any of the columns store "long" text the hash could be notably faster for data access - particularly if there are child tables.

Whether this happens in practice will depend on the data.

More to Explore

Design

New to good database design? Check out Chris Saxon's full fundamentals class.