Skip to Main Content

Breadcrumb

Question and Answer

Connor McDonald

Thanks for the question, John.

Asked: February 27, 2008 - 8:53 am UTC

Last updated: March 29, 2011 - 3:32 am UTC

Version: 10.1.0

Viewed 10K+ times! This question is

You Asked

Hi Tom,

The project I'm currently working on has the DB design rule that all tables must have a surrogate key. This is the case even if a perfectly good natural key exists.

The primary motivation appears to be that of join efficiency by removing the possibility of having to join two tables on more than one column.

Personally I'm not a fan of surrogate keys in general, or of this sort of blanket policy in particular. I believe that there is often much to be gained by inheriting the PK of the parent table into the child table as part of the PK of that table.

Also, when two tables are joined by a surrogate key there is an immediate loss of semantic information which would exist if natural keys were used instead. For example, what's the "real" relationship between the two tables and which of the "real" columns are involved in this relationship?

One thing that annoys me with designs like this is that the natural key column(s) from the parent table are often denormalised into the child table anyway for reasons of performance. So the only gain is usually the performance difference between joining on a single column and joining on multiple columns.

I've always thought this would be a marginal difference in most cases. Do you have any comment about this in general? Do you favour surrogate keys over natural keys? How tight would performance considerations have to be to justify such a scheme?

and Tom said...

...
The project I'm currently working on has the DB design rule that all tables must have a surrogate key. This is the case even if a perfectly good natural key exists.

......

perfect example of why I despise the guise of "best practice", or ROT - rules of thumb. One persons 'best practice' is another persons nightmare.

....
The primary motivation appears to be that of join efficiency by removing the possibility of having to join two tables on more than one column.
......

hmmm, wonder if anyone ever tested that.... It would not make that much difference - if any - in real life. Especially in a transactional system. It might be true in some cases that the index on a single number would be "shorter" (less levels) than an index on a multi-part key - but that would be one extra logical IO - in most cases, I would say it would not be worth it (for the reasons you list, loss of information, loss of semantic meaning, loss of detail)

If you have a natural key, by all means - use it. The natural key should be immutable and sensible - they are sometimes "rare", but if I had a document table, document version table - I would definitely use the document_id (might be a surrogate) as the primary key of one and (document_id, version# ) as the primary key of the other (and document version might have an association to authors so it's primary key is a foreign key elsewhere in the system too)

http://asktom.oracle.com/pls/ask/search?p_string=%22natural+key%22

Rating

  (35 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Surrogate versus Natural Keys

Monty, February 27, 2008 - 1:52 pm UTC

So, if we create a table with a surrogate key; create a unique index on the natural key; (we want to enforce data integrity) use only the columns in the unique index to select data for DML; ... what was the reason for the surrogate key to begin with?
Tom Kyte
February 27, 2008 - 7:16 pm UTC

indeed, what was the reason?

RE: "what was the reason for the surrogate key to begin with?"

Duke Ganote, February 27, 2008 - 4:22 pm UTC

When you've got a data model with child tables that have child tables that are also grandchildren of another table, then your "business key" (there's little "natural" about most keys... ok, off my soap box) gets to be cumbersome. In other words, the "business key" encompasses many columns.

Also, suppose your so-called "immutable" business keys need to change. Why? Perhaps because your company buys another one and you need to 'fit in' other data. With a surrogate key, you can change your unique constraint "business key" to a function-based constraint based on some new columns or conditional data -- without revising the entire data structure of child tables.
Tom Kyte
February 27, 2008 - 7:22 pm UTC

tell me what to do then with:

document:

document_version:

document_version_to_author:

author:


I understand we might use a surrogate for document. And maybe for author.


but what about document version and document version to author please?




Why surrogate keys?

Mike, February 27, 2008 - 7:30 pm UTC

I think the question of 'natural versus artificial keys' is different from the question of 'should child tables use the FK to their parent(s) in their keys?'

In either event, I think the bigger issue (that is, bigger than cumbersome compund keys) is: what happens when a key needs to change?

With a true synthetic key, you should never have to change it - as the business should not know it and therefore can never have a need to say it has the 'wrong' value.

With a natural key, the need may arise to update it. This gets much worse if there are child tables, as those rows also have to be located and updated. Throw in some grandchildren and great-grandchildren, stir in a handful of non-deferred foreign-key constraints and you have a real undertaking there.

Having said that, I prefer to use natural keys when they are available. But I have found more and more that the 'good, well-behaved' natural keys are rarely that - especially because there is usually some obscure case where they need to be updated. So I have found myself using artificial keys more and more often.
Tom Kyte
February 27, 2008 - 7:41 pm UTC

.. I think the question of 'natural versus artificial keys' is different from the
question of 'should child tables use the FK to their parent(s) in their keys?'
...

interesting, really? If you were in a shop that said:

"The project I'm currently working on has the DB design rule that all tables must have a surrogate key."

would you still say that? They didn't say "every table that isn't a child", they said "EVERY TABLE (eg: including the evil child table) must have a surrogate key.


So why change the question here?



... With a natural key, the need may arise to update it. ...

fail: take the document table with the 'sugar coated key of document id' - whatever that is, and the document version table.... do you need to update it?


that "we'll buy your company and we'll change your keys" does not fly with me either, that is a one time "we need to MOVE data from db1 to db3 (there is no db2 you see.....)", you obviously can rekey... You have to - there will be duplicates..

but tell me - how often does company A buy a company B that uses the SAME EXACT OFFERING OF SOFTWARE to even make this a consideration????? (I will guess: about ZERO, what do you guess??)

Let us stick with reality - an application, with keys that are provably immutable (the vast majority by the way...)

now what?

David Aldridge, February 27, 2008 - 8:09 pm UTC

Hmmm, what an interesting topic ...

Overuse of synthetic keys can of course require that queries include more joins in some circumstances, hence require more work to execute, in order to make a usable result set. Thus a subset of queries can end up paying a small tax for that design approach.

Underuse can require that some ddl runs enormously longer, as child records have to be modified, or you have to take a hit on development plus a maintenance outage to change a key from natural to synthetic.

You just have to "pick your poison".
Tom Kyte
February 28, 2008 - 3:29 pm UTC

as in "it depends" :)

Immutable Natural Keys

SeánMacGC, February 28, 2008 - 9:24 am UTC

One of the most inscrutable uses of surrogate keys, predominantly in data warehousing environments, is for DATE surrogate keys, where some folks deem it compulsory, apparently.

Not sure what the designers are fearing here, given that once a particular date always a particular date, in all systems -- a reversion to the Julian Calendar perhaps? (That's not to say, of course, that an event didn't occur on different date to that originally assigned)
Tom Kyte
February 28, 2008 - 11:01 pm UTC

... (That's not to say, of course, that an event didn't occur on different date to that originally assigned) ..

right, but that would require an update regardless of the use of the surrogate - in fact, one might say a "harder update", you cannot just fix a date from 15-mar-2007 to 25-jun-2007 with an update if you used surrogates, you have to look up what magic number you assigned to 25-jun-2007 first.

My 2 cents, or 20 cents...

Bill, February 28, 2008 - 12:20 pm UTC

Just reviewed your other surrogate/natural topics on AskTom. Looks like we are in agreement: would be nice to mostly use surrogate keys, but sometimes it depends on the nature of the data, the business, the anticipated use, etc.

In the ideal world found only in my head, we would have surrogate keys on every entity. Makes for easier join query writing, automation and a few other niceties.

But reality frequently gets in the way.

In my first job as an overcharged but underpaid Andersen Consulting grunt, the tables we used for a large telco had no artificial keys. So every table had a 3 to 8 column natural key. Other than that, the model was fairly well normalized. So in our code, where we needed to join 5 to 8 tables together, we frequently ended up with 30 to 40 lines just dedicated to the predicate to get them hooked up. Since Andersen's unit testing standard demanded a separate test condition for every permutation of a where clause, including NULL and mix/max values for each, I quickly became an undying fan of single column surrogate keys.

But I've seen it taken too far as well, creating surrogate keys for everything, including associative/cross-ref entities -- where they really didn't help anything, were never used and got in the way.

Couple more stories...

My last company's previous modeler insisted on using string-based short codes as the PK for most reference/lookup tables, mainly, they told me, to make querying easier for developers. They kept patting me on the head and telling me not to worry about meaningless PKs, that the codes would never change. 3 years later they wanted to change one of the codes, and found the update to the child tables would take 6 to 20 hours at every customer's site. Surrogate keys finally started to make sense to my technical architect who couldn't be convinced until that day came.

They sold a software system that was, for the most part, similar for all the participants in the energy industry. But there were enough variations that we had to maintain separate copies of the UI and data schema for the nine major markets. Underneath that, we had a common data model for the things that the nine markets shared. Some of these entities were partitioned and handled huge data volume.

Unfortunately the modeler that preceded me decided to use a multi-column, local natural key for these tables. Every once in a while, we would sell the product to a new customer that had new uniqueness requirements that didn't match the common model. So we would have to write a script that would re-create the enormous UK and roll it out to ALL the customers in ALL the markets to accommodate them. It meant hours of downtime for customers that shouldn't have been affected by a new customer across the country. I was able to convince them to at least start using market-specific UKs on the tables shared in common. Now I hear they may be factoring out the natural key into a master table, giving it a surrogate, and then using the surrogate as the PK for the partitioned beast. Should ease their woes.

So natural/business keys that change, can kill your business as well.

In the end, the judgment call for surrogate vs. natural depends and the circumstances. And that's where an experienced data architect/modeler is invaluable.
Tom Kyte
February 28, 2008 - 11:16 pm UTC

the only answer that always makes sense to any question is....


it depends :)

Love These Religious Wars.

Phil Singer, February 28, 2008 - 10:52 pm UTC

As has been said, an artificial key on every table is going to extremes. You end up with something like this:

From the Ed Sullivan Data Centre in Souix City Iowa:

Top Ten List

Today's Topic: Top 10 Reasons to have an Artificial Key on Every Table:

10. You don't ever have to look for natural keys again.

9. Name your key column 'PK', and all your join predicates can become just 'NATURAL JOIN'

8. You can have every other column in a row duplicated many times, and still claim to have a relational table.

7. A SELECT * in SQL*Plus looks just like what you get when you view it in MS Access.

6. CASE tools will give you this column automatically; 20% of your work is done with one click.

5. It makes it real easy to delete the duplicate rows you got because you didn't bother to put a unique index on the natural key.

4. You can astound your friends with how complicated your m<->m association tables are.

3. Once people learn of one of your artificial keys, it becomes a natural key, and you get to create another one.

2. The index on the artificial key on tables with a lot of updates/deletes is likely to become large, with a lot of sparse blocks in the tail end. This will give you a lot of indexes which could benefit from periodic rebuilding, thus justifying rebuilding all indexes on a weekly basis (which is probably another corporate standard).

And, the number on reason to have an Artificial Key on Every Table:

1. This is how SQL Server does it.
Tom Kyte
March 01, 2008 - 10:15 am UTC

very nice :)

I liked #3 and #1 in particular.

#9 would be a bug - as the PK in one table would be FK in the other and the natural join would do the wrongest of things...

Immutable Natural Keys

SeánMacGC, February 29, 2008 - 6:39 am UTC

Followup February 28, 2008 - 11pm US/Eastern:
... (That's not to say, of course, that an event didn't occur on different date to that originally assigned) ..

right, but that would require an update regardless of the use of the surrogate - in fact, one might say a "harder update", you cannot just fix a date from 15-mar-2007 to 25-jun-2007 with an update if you used surrogates, you have to look up what magic number you assigned to 25-jun-2007 first.


Exactly -- a totally pointless level of indirection and obfuscation.

refresh my memory...

Connor, March 05, 2008 - 8:06 am UTC

I was reading something in the last couple of weeks, where a built-in dbms package (that did some sort of table maintenance) could only except 1 or 2 column primary keys as input. Anything more than 2 was not permitted...

Can't remember what it was ... Anyone ?

but that restriction may be an issue when deciding on natural vs surrogate


Surrogate key is just workaround

Zilvinas, July 04, 2008 - 4:57 am UTC

In theory there is no any surrogates.
Business model does not have any surrogates.
RDBMS are made following the rules of theory.

What do you want to cheat with your surrogates? Business? RDBMS creators? Yourself?

One of main requirements of database is data integrity.
When designers use surrogate keys they often loose data integrity. If they will try not to then they will end up with the same natural design, only that every table has one excess column and one excess index.

If you design your database following theory, you will find rare cases where you need to use workaround (surrogate key).

As for "never use business data for primary keys" I agree. Make ID's for you business "atom objects". But from this point use natural design like in example with document versions.

As for "what if happens something what has a chance of happening of one billion'th" it's just stupid. There's much bigger chance that you system will be obsolete way before it happens.

And one question. Many Web developers say they absolutely need id for every table. Are they not experienced enough or they use some crappy frameworks or something else?

Oracle learner, December 09, 2008 - 9:48 am UTC

Hi Tom,

If I understand your argument correctly then if I have a table like the following:


ID Country_Code Country_Description
1  USA          United States of America
2  UK           United Kingdom
3  F            France
4  D            Germany


then I could dispense with the surrogate ID column and have a primary key on the 'Country_Code' column. However, since the Country_Description column is also unique should I add a unique constraint here to maintain data integrity, or am I missing the point?

Thanks

Tom Kyte
December 09, 2008 - 2:17 pm UTC

do you have a business rule that says the country_description must be unique (probably you do). If so, you would want to enforce that rule - yes.



Kevin, April 21, 2009 - 11:41 am UTC

Hi Tom,

The main rationale I see for introducing surrogate keys is the overhead of large foreign keys. A common form of example is:
create table files (
file_name varchar2(100) not null,
file_date date not null,
constraint files_pk primary key (file_name))
/
create table transactions (
file_name varchar2(100) not null,
account varchar2(20) not null,
amount  number(15,2) not null,
constraint transactions_pk primary key (file_name,  account),
constraint transactions_fk1 foreign key (file_name) references files(file_name) on delete cascade)
/

Which could also be modeled w/ surrogates as:

create table files_with_surrogate (
file_id  number not null,
file_name varchar2(100) not null unique,
file_date date not null,
constraint files_ws_pk primary key (file_id))
/
create table transactions_with_surrogate (
file_id number not null,
account varchar2(20) not null,
amount  number(15,2) not null,
constraint transactions_ws_pk primary key (file_id,  account),
constraint transactions_ws_fk1 foreign key (file_id) references files_with_surrogate(file_id) on delete cascade)
/


Assume file_names average ~100 bytes in size, and there are an average of 1 thousand transactions per file_name:
 
INSERT INTO FILES 
SELECT RPAD('f',90,'f')||LEVEL, SYSDATE 
FROM DUAL CONNECT BY LEVEL < 201
/
INSERT INTO TRANSACTIONS
WITH T1 AS (SELECT LEVEL N FROM DUAL CONNECT BY LEVEL < 1001)
SELECT FILE_NAME,   
       RPAD('A',10,'A') || TRIM(TO_CHAR(N,'0000000000')) C2, 
       N / 100
FROM FILES  , T1
/
INSERT INTO FILES_WITH_SURROGATE 
SELECT LEVEL, RPAD('f',90,'f')||LEVEL, SYSDATE 
FROM DUAL CONNECT BY LEVEL < 201
/
INSERT INTO TRANSACTIONS_WITH_SURROGATE (FILE_ID, ACCOUNT, AMOUNT)
WITH T1 AS (SELECT LEVEL N FROM DUAL CONNECT BY LEVEL < 1001)
SELECT FILE_ID,   
       RPAD('A',10,'A') || TRIM(TO_CHAR(N,'0000000000')) C2, 
       N / 100
FROM FILES_WITH_SURROGATE, T1
/
COMMIT
/
BEGIN
  DBMS_STATS.GATHER_TABLE_STATS ('APPS','FILES');
  DBMS_STATS.GATHER_TABLE_STATS ('APPS','TRANSACTIONS',NULL,10);
  DBMS_STATS.GATHER_TABLE_STATS ('APPS','FILES_WITH_SURROGATE');
  DBMS_STATS.GATHER_TABLE_STATS ('APPS','TRANSACTIONS_WITH_SURROGATE',NULL,10);
END;
/

This is a situation where the surrogate key approach shines; the following are much more expensive for the natrual-key apporach compared to the surrogate-key "file_id" solution:

1) Space needed for transactions (27 Mb) vs transactions_with_surrogate (8 Mb)

2) Space needed for transactions_pk (53 Mb) vs transactions_ws_pk (13 Mb)

3) Overall cost of joins, e.g.:
SELECT  F.FILE_DATE, SUM(AMOUNT) 
FROM    FILES F, 
        TRANSACTIONS T
WHERE   F.FILE_NAME = T.FILE_NAME
GROUP BY F.FILE_DATE


recursive calls 1
db block gets 0
consistent gets 3459
physical reads 0
redo size 0
bytes sent via SQL*Net to client 279
bytes received via SQL*Net from client 230
SQL*Net roundtrips to/from client 2
sorts (memory) 1
sorts (disk) 0

vs.

SELECT  F.FILE_DATE, SUM(AMOUNT) 
FROM    FILES_WITH_SURROGATE F, 
        TRANSACTIONS_WITH_SURROGATE T
WHERE   F.FILE_ID = T.FILE_ID
GROUP BY F.FILE_DATE


db block gets 0
consistent gets 949
physical reads 0
redo size 0
bytes sent via SQL*Net to client 279
bytes received via SQL*Net from client 256
SQL*Net roundtrips to/from client 2
sorts (memory) 1
sorts (disk) 0


Either of the values "bytes per filename" or "transactions per file" can be increased to further the degradation on points 1-3 of using the natural key.

Having said that, I dislike the idea of complicating a data model for performance, and consider it a "last resort". Is there an alternative? Given the declaration of the foreign key relationship, it seems that behind the scenes, Oracle could implement the natural data model by
1) maintaining ROWID pointers from transactions to files; and
2) use behind-the-scenes joins to resolve the transactions.file_name value when it is called for.

Put another way, when I run the natural key DDL for TRANSACTSIONS above, Oracle could build this behind-the-scenes objects:
CREATE (hidden) TABLE RAW_TRANSACTIONS
(FILE_ROWID ROWID, ACCOUNT, AMOUNT,
CONSTRAINT P1 PRIMARY KEY (FILE_ROWID, ACCOUNT),
CONSTRAINT C1 FOREIGN KEY (FILE_ROWID) 
    REFERENCES FILES.ROWID)
/
CREATE (view-treated-as-visible-table) TABLE TRANSACTIONS
AS SELECT FILE_NAME, ACCOUNT, AMOUNT
FROM RAW_TRANSACTIONS, FILES
WHERE RAW_TRANSACTIONS.FILE_ROWID = FILES.ROWID
/


Then, the CBO would need the ability to rewrite this query:
SELECT  F.FILE_DATE, SUM(AMOUNT) 
FROM    FILES F, 
        TRANSACTIONS T
WHERE   F.FILE_NAME = T.FILE_NAME
GROUP BY F.FILE_DATE


as this:

SELECT  F.FILE_DATE, SUM(AMOUNT) 
FROM    FILES F, 
        RAW_TRANSACTIONS T
WHERE   F.ROWID = T.FILE_ROWID
GROUP BY F.FILE_DATE


(reminiscient of the fancy-footwork done with COMPRESS tables)

I'm aware that using ROWID / memory pointer data at the logical level is intrinsically unstable, and I'd not dream of declaring a transactions table with a FILE_ROWID column. However, as long as Oracle was doing the bookkeeping on the physical pointers, could this work? More pointedly, is there any chance Oracle does or might offer such a feature?

-Kevin
Tom Kyte
April 21, 2009 - 3:48 pm UTC

In most cases, you would not be getting the entire set (think "emp and dept") so the consistent gets do not put me off so much (nor does the storage really).


How about this, what if in transactions you always needed to see the file_name.

when you take the natural key out, you HAVE to walk the surrogate back to the other table in order to get it (and many times you want it, so many times you end up joining by the surrogate in order to GET the natural key)

so, for every pro, there is a con, for every con there is a pro.

Funny thing life is, while I was reading this - I received an email (no joking, this really happened) and the subject was:

"use of surrogate keys"

Another (serious) downside to surrogate keys (in addition to being FORCED to walk back to the source table to pick up the natural key!) noted in that email to me (part of an ongoing discussion) was:


o Contention and I/O management

Massive insert rates usually require the primary key index to be maintained as part of the workings of the database. Using a sequence or encoded timestamp results in a right growing index and often becomes the single point of contention on a system.


definitely, that happens frequently. People start introducing things like reverse key indexes, hash partitioning - in efforts to reduce that.



And as stated in other articles (on asktom) and here, there is a tradeoff to be made. If you have a natural key that is truly 100 bytes on average (I would personally call that the edge case, the extreme case) then we might have something to discuss.

Surrogate keys as the rule - absolutely not.
Surrogate keys as something to consider in balance with everything else - quite so.





A reader, April 21, 2009 - 5:24 pm UTC

Hi Tom - appreciate the thoughts and insights.

A more direct question - in the situation where, after weighing pros/cons, it is deemed "best" to use a surrogate, would you think it useful in a future Oracle release to have a DDL keyword that tells Oracle, "link these tables over this foreign key constraint using a hidden surrogate key". The syntax might look something like:

create table transactions (
file_name varchar2(100) not null,
account varchar2(20) not null,
amount number(15,2) not null,
constraint transactions_pk primary key (file_name, account),
constraint transactions_fk1 foreign key (file_name) references files(file_name) USE SURROGATE KEY
)


When used, there is an understanding that all the performance pro's and con's of surrogate logic apply, *but* we let the database worry about the implementation details (including query rewrites for natural-key joins), and proceed with all application designs as though we're just using a natural key.

Or, is there a fundamental reason (some short-coming of the automated approach) that surrogate keys should always be explicitly implemented by the database designers, managed by the applications, and exposed in the data dictionary?


Tom Kyte
April 21, 2009 - 5:29 pm UTC

we do this with nested tables today - using a 16 byte raw populated with sys_guid()

I am not a fan.

I am not a fan of 'magic' and 'side effects', I have no problem being painstakingly forward in my linear code showing exactly what is happening (I am on record many times saying "do not use nested tables to persist data in the database, use them in plsql code, but never in DDL)


I don't have a problem with building a data model - explicitly, with great detail, "on purpose".

Java

Pat, April 22, 2009 - 3:25 pm UTC

I am working on a project where everything is done with Java frameworks and Hibernate has been selected to interface with the database.

I was told that ALL tables must have a surrogate key that is a VARCHAR2(32) because that is what Hibernate uses internally and it cannot use a multi-column PK.

So, ALL our tables now have a single column surrogate key and we have unique indices on the natural keys, to make sure Hibernate doesn't corrupt the data.

In my case, there was not even a case to discuss. Hibernate was selected and its way of working was imposed onto the DB people.

I simply hope that no other application will need to access this DB directly, since everything is now hidden inside Hibernate.

I was told that PL/SQL was a big NO because they wanted the code to be "database agnostic". Experience taught me there is no point arguing with the Java people because they are not willing to listen to anything that contradicts their gospel.

To finish up, here is a wonderful example of one of our important tables, the calendar that maps the fiscal calendar used by the clients. Instead of using the date as the PK, we have an anonymous serial number given by a sequence stored into a VARCHAR2(32). Simply brilliant! Thanks Hibernate.
Tom Kyte
April 24, 2009 - 4:33 pm UTC

... I was told that ALL tables must have a surrogate key that is a VARCHAR2(32)
because that is what Hibernate uses internally and it cannot use a multi-column
PK.
...

that is a lie and your developers should be banned from the database for that reason alone.

... to make sure Hibernate doesn't corrupt the data.
...

no no, hibernate doesn't corrupt data, that is what your developers do.


Don't blame hibernate actually, it is your developers - and why are they in charge? I'll never get that.

Don't get me wrong, I'm not a fan of hibernate, but only because of the company it attracts.

Lev, April 22, 2009 - 6:40 pm UTC

Pat,

I've been there and I know that "database agnostic" = disaster.
There is no way you can avoid it. If your application
executes anything more complicated then select * from emp
it can't be database agnostic.

Hibernate is not that bad when used properly.
The problem is when you are using Hibernate you need to understand both
Hibernate AND database.

The biggest issue that above mentioned disaster could happen late - when almost all
development is done and you can't change anything.
The earlier it happens - the better.

That's what I got from my (very painful) experience.

1) Insist that you need database constraints (PKs, FKs etc).

2) If 1) is not possible - create scripts verifying
data quality and include them in nightly build. Let managers see the results.

3) Check sql ¿ queries that are generated by Hibernate. Sometimes it generates 3 queries instead
of one. Insist that developers should fix that.

4) And the most important - don't allow developers to work on empty database.
Populate database with at least 50% of real amount of data.
Java people like to wipe database, recreate database structure, insert 2-3 records in
several tables and than execute unittests. If there were no errors they think that they
are on the road to success.
They call it TDD.
Find a way how to execute their unittetsts on real amount of data at least once a day.
Usually management can understand importance of testing application in real environment.

And the last - you really need some good luck.
Believe me very soon you will hear that project needs another database (to be database agnostic) or even worse - new DBA.

That's why good luck!
Lev

Tom Kyte
April 24, 2009 - 4:35 pm UTC

good points all.


and the only reason #1 would not be possible is if the developers actually allow the data to be inconsistent in between commits (eg: deferrable constraints should ALWAYS work, unless the developers allow the data to be inconsistent between commits - which will obviously lead to data inconsistencies and the need for that "data checker" you wrote of)

MYTH: "You only need to unit test"

Duke Ganote, April 25, 2009 - 4:15 pm UTC

If your developers say "you only need to unit test", they have no idea what TDD is.

http://pages.citebite.com/t1h4d1i7o6nwx

http://www.agiledata.org/essays/tdd.html

Follow up

Pat, May 06, 2009 - 9:17 pm UTC

A quick follow up on what I wrote earlier:

1. We now are giving the system to users for serious testing and, of course, they find bugs. Well, the problem now is that everything is in Hibernate and I have absolutely no clue what query gets generated. I know I could ask for the ASCII strings, but I am a bit afraid of what I would be given...

2. We also have heavy duty processing written in Java that is using Hibernate to access the DB. We have a pretty decent volume of data to crunch regularly and of course, the Java code is multi-threaded (as it should be.) However, now and then, we have dead lock errors that pop up, and I have no clue where those are coming from. My hunch is that the Java code is relying on deferred commits from Hibernate and somehow, the sets the threads process overlap. Since Hibernate returned "ok" when asked to commit WITHOUT actually committing on the DB side (to get a better throughput probably), it still holds locks for a little while. Then, the thread receiving the "ok" from Hibernate moves on and, from time to time, is going to pick up more data to process that is overlapping with the rows being locked by Hibernate. If the thread is faster doing its job than Hibernate is doing its job before sending the UPDATE statements to the DB and committing, Oracle will detect a dead lock. This is just a hunch I have and I don't know for sure, because it is VERY difficult to reproduce, which, to me, indicates a race condition. I suspect they could turn Hibernate into a fully synchronous mode (Hibernate would not return OK until it has actually committed in the DB) and that would remove the dead lock problem, but I bet it would kill the performance. So, they probably decided to use Hibernate in asynchronous mode (it returns ok even before it committed the changes to the DB) and they are betting that the race condition won't happen too often.

3. Fortunately, we have FK's everywhere and we also have unique constraints on the natural keys, so we can make sure the data won't get corrupted.

4. This constraint of forcing an anonymous, single column PK also has consequences on the schema design. When you have a table, a child table and a grand child table, if you only join them with their respective anonymous PK, you can get yourself into a situation where you lose information on the dimensional side of things. This means that it becomes next to impossible to write MERGE queries on the grand children tables based on the actual dimensions that should be inherited. You now have to write a row by row approach to enforce "by hand" dimensional constraints that have been lost along the way.

I think a lot of Java developers have turned into religious fanatics who simply refuse to open their minds and want to be protected from the evil world of databases. :)

Had it been my call, I would have built a nice PL/SQL business API (not table based) that takes care of locking the rows I need to modify... But then, it would not be "database agnostic" which is probably one of the biggest lies ever, along with "write once run everywhere". But what do I know, I am just a DB guy. :)
Tom Kyte
May 11, 2009 - 11:27 am UTC

... We have a pretty decent volume of data to crunch regularly and of course, the Java code is multi-threaded (as it should be.) ...

I would disagree with that, the database is a pretty good 'parallel processing thing'. The data crunching should be single threaded with PARALLEL QUERY.

opinion: most developers are not capable of writing thread safe applications, reasoning for my thinking that: most are not capable of writing a trigger that is multi-user safe, they do not think about race conditions - what happens when 50 thing happen at once.


and reading the rest of #2 confirms this opinion.



DB agnostic code

Pat, November 05, 2009 - 6:05 pm UTC

One of the key features "sold" by the Hibernate (and other ORM frameworks) users is that the code they write with it will magically become database agnostic.

In other words, the theory is that you could write an application with Hibernate with Oracle as a DB engine and then, decide to use MySQL or SQL Server instead. In theory, you would just need to recreate the tables, migrate the data and you don't need to change anything in the Java code.

I have to admit I have a hard time getting my head around this concept. Since Oracle implements "read consistency" everywhere all the time (readers don't block writers and vice versa) the Java code will have been built tested and debugged with that paradigm baked into the handling of transactions. If you move to another DB engine like MySQL for instance (which, as far as I know, does NOT implement read consistency), I don't understand how the exact same code will be able to generate exactly the same results all the time, especially when stressed by many concurrent users.

Does anybody have real world experience with swapping out the DB engine running under a Java code base using Hibernate? What were the results?
Tom Kyte
November 11, 2009 - 1:25 pm UTC

I've had a hibernate developer tell me:

that is ok, you just check the 'transactional' box and it'll fix all of your code.


I showed him a piece of logic that works provably in sqlserver - a bit of logic to pull changes from one system to another.

I showed him how that same logic would not work correctly in Oracle - using any isolation mode, the LOGIC had to change.

He said "hibernate would do that, just check transactional"

That is when I knew it was pointless to discuss, that developer had no idea really how the database worked or what transactional meant really - they were just told to check that box and all would be ok.



Hibernate-generated schema

C. W., January 29, 2010 - 11:26 am UTC

As a java developer (and a hibernate developer too), anyone who would suggest relying blindly on Hibernate to generate a good schema based only off of the object domain model is drinking the kool-aid. That's not the intent of the Hibernate people. From the book "Java Persistence with Hibernate" which is the revised edition of "Hibernate in Action" written by the Hibernate code authors, page 377: "you should always ask your DBA to finalize your database schema." And then, "We recommend that DBAs get involved early and verify the automatically generated DDL from Hibernate." And finally, "a professional DBA verifies and optimizes the SQL DDL and creates the final database schema."
Tom Kyte
January 29, 2010 - 4:24 pm UTC

wonder how many hibernate developers

a) read the book
b) follow it.


In my experience, the number is in the single digits. especially for (b)

Oleksandr Alesinskyy, January 30, 2010 - 4:10 pm UTC

wonder how many hibernate developers

a) read the book
b) follow it.

Quite a lot - it is a kind of "Hibernate bible".

As for "Hibernate developer saying 'check a box'" from your previous comment - he was not a Hibernate developer at all, as Hibernate as such does no provide any user interface and so has not 'boxes to check'.
Tom Kyte
February 01, 2010 - 10:16 am UTC

Maybe you think they do (a).
I know they do not do (b).


to "check a box" is a saying, an idiom. We "check boxes" to signify "they are done"

they enabled hibernate transactions:
https://www.hibernate.org/42.html

and said "it is all good, we are transactional now", with a very big smile on their face.


Again, I haven't met the kind of hibernate developer you talk of. I meet the ones that say "hibernate does that for us, we don't need to know what the database is or does, we use transactional"


I'm reminded of "it's what plants crave" from a favorite movie of mine when talking to them

http://www.imdb.com/title/tt0387808/quotes#qt0427932

@Oleksandr

Enrique Aviles, January 30, 2010 - 5:58 pm UTC


"Quite a lot - it is a kind of "Hibernate bible"."

I bet they follow the book a lot for Hibernate specific information, not necessarily for the referenced comments related to DBAs and database schemas.

@Enrique Aviles

Oleksandr Alesinskyy, January 31, 2010 - 3:42 pm UTC

I bet they follow the book a lot for Hibernate specific information, not necessarily for the referenced comments related to DBAs and database schemas

It is hard to miss as it repeated (with minor variations) all over the book, several dozen times. Invest a couple of hours in skimming over that book.

Alexander, February 01, 2010 - 11:52 am UTC

Tom,

Do you still have that example you used, the one that showed something that would work in sqlserver but not in oracle? Is it something you could share?
Tom Kyte
February 01, 2010 - 12:11 pm UTC

I have examples in every book I've written, but the one in this case is outlined here:

http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:16998677475837

basically - it is a 'warehouse refresh', done in a database independent fashion.

It would "work" by having a trigger timestamp the time every record was last modified.

what would happen is, the refresh process would:

a) record what time it is RIGHT NOW into a table.
b) read out the last time it refreshed from that table (the prior (a) it had recorded)
c) pull every record that was modified since the time queried in (b) and apply it to the warehouse
d) commit

and do that over and over. Thinking that all changes would be pulled.

Well, let's say that the systems were in sync for a minute (no transactions and we made sure both systems are IDENTICAL). So far, so good.

The last time we refreshed was 8am today, recorded in our table.
At 9am, we record in this table "9am" and read out 8am.
We pull everything modified since 8am and apply to warehouse
we delete the 8am record
we commit.

We are in sync right? Well, no - because what I didn't tell you was that at 8:59:59am - I started a transaction. The record would be timestamped with 8:59:59amam - but since your select started at 8am (before I committed) also (at about or near the same time, doesn't have to be precise, it could in fact be far apart!) - your select would NOT see my modification. You would not block on the updated and uncommitted row - and if it were committed - you would not see that value - read consistency and all. So, your select would skip it.

And at 10am, you would read out 9am and miss it *again*, in fact you would miss it forever. Over time, the data would diverge in the two systems further and further. Spurious data inconsistencies would be reported by end users and then finally someone would look and find hundreds/thousands of inconsistencies. You (developer) would blame the database, sync the data and all would be well for a while - until it happens all over again.



In sqlserver - by default - your select at 9am would either hit the record before it was committed - get blocked and then read it, or it would hit it after I commit and since they by default do not do a consistent read - you would just read whatever you saw there.


Same logic, different outcome.

John Gilmore, February 01, 2010 - 11:57 am UTC

...
I showed him a piece of logic that works provably in sqlserver - a bit of logic to pull changes from one system to another.
I showed him how that same logic would not work correctly in Oracle - using any isolation mode, the LOGIC had to change
...


I was just wondering if you could share the details of the logic with us, I'd be very interested to see it.
Tom Kyte
February 01, 2010 - 12:12 pm UTC

there are an infinite number of examples..... Just think of any multiuser situation with concurrent reads and writes of data.

Here is another one: reporting on the amount of money in the bank. What works very well in Oracle doesn't work at all in Sqlserver (by default)

see
https://www.oracle.com/technetwork/issue-archive/2005/05-nov/o65asktom-082389.html

natural keys

A reader, February 01, 2010 - 8:00 pm UTC


What is next?

A reader, March 03, 2010 - 2:39 am UTC

Hi Tom,

Okay we made the decision to use some generated keys - now the question is - do you see benefits of one over another for the keys below?

1) Surrogate keys (or synthetic keys or artificial keys)
2) Intelligent Keys (or smart keys)
where surrogate keys are just sequences over source natural key, source system id, and effective date (model where we track changes) but intelligent key is generated through combination of source natural key, sources system id, and effective (in the format YYYYMMDDHH24MISS) by concatenation

Thanks


Tom Kyte
March 03, 2010 - 7:11 am UTC

I call those intelligent keys "so called intelligent"

I would stay as far away from them as possible....


Now, if your intelligent key isn't really an intelligent key but a NATURAL KEY - we'd be having a different discussion.

You use both terms - seemingly interchangeable. If you have an immutable primary key - a natural key - use it. If not, you'll need a surrogate.

You would NOT use concatenation - that would create a non-intelligent, dumb key.

You would use a compound primary key, the three attributes. That would be a smart, primary, natural key.

Meaningless Key

A reader, March 05, 2010 - 2:43 am UTC

Hi Tom,

Whether it is dumb key or surrogate (meaningless key), is there any benefit of one over other? One benefit I can see is with surrogate key it could potentially occupy less storage i.e., the size of the column can be controlled and made consistent throughout the model:-)

Thanks

Tom Kyte
March 05, 2010 - 5:45 am UTC

but, umm, you have to store the data don't you.

which means if you use a surrogate, your storage needs are:

a) size of natural key
PLUS
b) size of surrogate


whereas if you do not use surrogate, your storage in the table is

a) size of natural key

and you'd need to have two indexes in the first case (one for primary key, one for unique constraint on natural key).


So, it could potentially occupy MORE storage too.


Some problems when using subrrogate keys

Marco, December 10, 2010 - 6:31 am UTC

First, thank you very much for the interesting thread.

I find natural keys better than subrrogate keys for the following reasons:

- The values of generated surrogate keys have no relationship to the real meaning of the data held in a row.

- Since surrogate keys replace identifying attributes and since the identifying attributes are likely to be those queried, then the query optimizer is forced to perform a full table scan when fulfilling likely queries.

- The presence of a surrogate key can result in the database administrator forgetting to establish, or accidentally removing, a secondary unique index on the natural key of the table. Without a unique index on the natural key, duplicate rows can appear and once present can be difficult to identify.

- Because surrogate keys are unnatural, flaws can appear when modeling the business requirements.

- Proprietary information can be leaked if sequential key generators are used.

However, here I'd like to explain a particular case where using subrrogate keys also involves in data inconsistency. Please, consider the following example:

create table providers
(
prov_id number(6) not null,
country_id varchar2(2) not null,
ins_date date,
created_by varchar2(10)
);

alter table v add constraint providers_pk primary key (prov_id, country_id) using index ;

create table services4country
(
serv_id number(4) not null,
country_id varchar2(2) not null,
ins_date date,
created_by varchar2(8),
description varchar2(80)
);

alter table services4country add constraint services4country_pk primary key (serv_id, country_id) using index;

create table emails_providers
(
prov_id number not null,
serv_id number(4) not null,
country_id varchar2(2) not null,
pwd varchar2(20) not null,
email varchar2(100)
);

alter table emails_providers add constraint emails_providers_pk primary key (prov_id, serv_id, country_id) using index ;
alter table emails_providers add constraint emails_providers_fk_01 foreign key (serv_id, country_id) references services4country (serv_id, country_id);
alter table emails_providers add constraint emails_providers_fk_02 foreign key (prov_id, country_id) references proveedores_web (prov_id, country_id);

If you decide to migrate the natural keys of these tables to subrrogate keys, for example because of implementation simplicity with Hibernate you will allow data inconsistency on table emails_providers because the value of the field country_id could be diferent on the other two tables.

Best regards;

Marco

DBAs need to calm down on Hibernate

A reader, March 28, 2011 - 10:35 am UTC

I see no reason using a natural key. It just complicates db design and further Application design and all other queries. The last 6 projects I worked on, I used Hibernate to generate the schema design, yes with perfect indexes, and they run in production just fine. See this old but useful blog post:
http://www.techrepublic.com/article/the-great-primary-key-debate/1045050

DBAs think that Developers are dumb and they need to calm down.

DBAs did noy create Hibernate or JPA. DBAs did not create MongoDb or MapReduce or Bigtable. DBAs did not create unit testing methodoglies using dbunit and spring. They probably are scratching their heads with all the jargon because they DON'T learn new technologies. Why would they? SQL never changes. Sorry that you're losing your job but reality changes quickly just like technology.

I have perhaps worked with one really smart DBA in my entire career and that was seven years ago. It's frustrating to see DBA's come and tell developers how to name columns among other things when they don't understand what modern frameworks and agile practices can do.

In summary, there might be a good DBA and a bad developer, but most times, using a good ORM stack with best practices often eliminates the need for a DBA and their old failed approaches of db design. All low level details that concern Transaction isolation or db centric behaviour that can very well be abstracted at application configuration.
Tom Kyte
March 29, 2011 - 3:32 am UTC

Developers need to calm down on using hibernate.
there we are even.

I'm a developer by the way, not a DBA and I share the trepidation of many a DBA over the "hibernate" thing.


I see no reason using a natural key. It just complicates db design and further
Application design and all other queries.


I see no reason not to use a natural key when one presents itself. It eases greatly the database design and makes query generation simple. Now what? That is my experience. Seems different than yours.


DBAs think that Developers are dumb and they need to calm down.


Ummm, no they don't. They think a one size fits all, they think a "the application is the end goal", they think that "performance is the job of the DBA" is dumb. They don't think an entire class of people are dumb, they think people that think like that are dumb.

DBAs did noy create Hibernate or JPA. - NO KIDDING :)

DBAs did not create MongoDb or MapReduce
or Bigtable. - guess what - DBAs did not create Oracle !!!! DBAs did not create the RDBMS theory or practice. Now what???? Your point is meaningless.

They probably are scratching their heads with all the jargon because
they DON'T learn new technologies.


Now, who is calling who dumb??

SQL never changes. Sorry
that you're losing your job but reality changes quickly just like technology.


Now I'll call you myopic, or dumb, or just "head in a hole in the ground". That was a pretty stupid statement. Perhaps you should read a thing called a new features guide. We put out one (big) with each release - in the last 10 years we've gone from 8.1.5, to 8.1.6, 8.1.7, 9iR1, 9iR2, 10gR1, 10gR2, 11gR1, 11gR2 - that would be nine major releases - each with, well, hundreds of new things. SQL doesn't change? I believe you are the one that hasn't kept up.

but most times,
using a good ORM stack with best practices often eliminates the need for a DBA
and their old failed approaches of db design.


old failed? huh, interesting, since MOST of the databases are in an RDBMS format, since your paycheck comes from one of them, since your credit card transactions are definitely in one, since your reservations for hotels, etc etc etc - are in them.

And DBA's do not really do the database design, that is a function of a data modeler - DBA's do *administration*, they also know how to use the database in the best way in general (not always, but most of the time)

And how would an ORM stack eliminate the need for a DBA - that my friend is a stupid statement.

You don't need installation, upgrade, maintenance? You don't need backup/recovery? You don't need monitoring? I think you don't get the bigger picture - at all.


To "A Reader"

Kevin, March 29, 2011 - 7:55 am UTC

I am going to attach a link that I discovered through Tom's blog. A developer who gets what the database is and is for - AND how that an object stack on top of a database really is the wrong way to go about it.

Oh and he touches on determining keys as well ( just so this stays on topic :-) )

http://database-programmer.blogspot.com/

To A Reader

Bill S., March 29, 2011 - 9:14 am UTC

"DBAs think that Developers are dumb and they need to calm down. "

Wow, I'm in deep trouble. I spent 10 years as a developer and the last eight as a developer/DBA. I don't use Hibernate, I think tools that decide the best SQL for the job without understanding the job are dangerous.

So I guess I think I'm dumb.

Oh Really?

Galen Boyer, March 29, 2011 - 12:25 pm UTC

> I see no reason using a natural key.

Its a discussion. I can see reasons for both sides of the coin.

> It just complicates db design

How exactly? Seems like someone said these words to you. Please give
an example of why natural keys are "more complicated" than surrogates.

> and further Application design

Here's the crux, isn't. The application wants a single id to
represent the row of data it is looking at, ie, the data that the
object represents. Why doesn't the application allow for the "id" to
have more than one attribute? Why doesn't the application allow for
the "id" to be a character?

> and all other queries.

How exactly? Seems like someone said these words to you. Please give
an example of why natural keys are "more complicated" than surrogates.


> The last 6 projects I worked on, I used Hibernate to
> generate the schema design, yes with perfect indexes

Exactly what is a "perfect" index?

> and they run in production just fine. See this old but useful blog post:
> http://www.techrepublic.com/article/the-great-primary-key-debate

Well, in the first paragraph we read "Technically, there is no right
or wrong to this debate only very strong opinions". So, hm..., you
see "no reason" but the article you site already says there are
reasons.

But, then the article says, "The main purpose of a primary key is to
relate records to additional data stored in other tables". That is
not the main purpose of a primary key, sorry, it just is incorrect.
First things first, a primary key is just a key. The fact that
someone decided that a table must have a primary key is really wrong
in all databases. But, either way, it represents a relational key.
And a key is forced on you because you are building a relation.

I also see a little bit later he writes in the following, "The primary
key must be compact and contain the fewest possible attributes." Yet
again, where does relational theory say anything about this? That's
just a statement made as though it is fact so he can make the
surrogate argument later.


He then does the, lets show a scenario and why natural keys fall
down. "Initially the name appears as a perfect candidate,
... therefore, you can't enter any of the employee's data until the
proper name is known." Hm... So, lets enter the record with nothing
to identify it except a surrogate key. Then save and move on. How
do we get that record back to work on it later? That record one is
entering wouldn't be in the customer table, it would be in some work
area, so, his convenient example isn't an example. Its more an
example that points out the rule that were you to go with surrogates,
then you need another key.

WHICH, I WILL ALWAYS SAY IS A RULE. IF YOU GO WITH SURROGATES, THEN
YOU MUST HAVE A NATURAL KEY DEFINED AS WELL.

Further down, he yet again says, "Used correctly (to establish
relationships),". This guy, and you I would guess, probably believe
that a foreign key means a relation. It doesn't. A foriegn key is a
constraint. A relation is the attributes within a row in a table.
Relational modeling is about forming that relation. Drawing the lines
between the tables isn't relational. Drawing lines between building
the tables in a model means you are building constraints within your
database. A foreign key does not relate two tables together. It
constraint the set of possible values in an attribute of a table with
the key of the other table.

> DBAs think that Developers are dumb and they need to calm down.

DBAs think developers want to not think about the database, and in
most cases they are correct. Developers want to be database ignorant
because their language says the code they write should be database
independent.

> DBAs did noy create Hibernate or JPA. DBAs did not create MongoDb or
> MapReduce or Bigtable. DBAs did not create unit testing methodoglies
> using dbunit and spring. They probably are scratching their heads
> with all the jargon because they DON'T learn new technologies. Why
> would they? SQL never changes. Sorry that you're losing your job but
> reality changes quickly just like technology.

> I have perhaps worked with one really smart DBA in my entire career
> and that was seven years ago. It's frustrating to see DBA's come
> and tell developers how to name columns among other things when they
> don't understand what modern frameworks and agile practices can do.

Wow. Can't live by a standard set forth by someone else? What
production shops are you in?

> In summary, there might be a good DBA and a bad developer, but most
> times, using a good ORM stack with best practices often eliminates
> the need for a DBA

Yeah, eliminates. Sure.

> and their old failed approaches of db design. All low level details
> that concern Transaction isolation or db centric behaviour that can
> very well be abstracted at application configuration.

Look, I'm a hibernate fan. I'm a java developer fan. I love that
language. But, I truly can't stand developer statements that you are
spouting. Java hasn't replaced the need for dbas. Java has increased
the need for that application developer to be better at databases.
Its just too bad that developers think hibernate solves that for them.

While he is Hibernating, Spring this on him

Duke Ganote, March 30, 2011 - 3:50 pm UTC

"Database as a Fortress"
#23 of "97 Things Every Software Architect Should Know"
-----------------------------------
"A solid data model is one that guarantees security of today’s data, but also extensible for tomorrow’s. Guaranteeing security means being impervious to bugs that will – despite your best efforts – be pervasive in an ever-changing application layer. It means enforcing referential integrity. It means building in domain constraints wherever they are known. It means choosing appropriate keys that help you ensure your data’s referential integrity and constraint satisfaction. Being extensible for tomorrow means properly normalizing your data so that you can easily add architectural layers upon your data model later. It means not taking shortcuts.

"The database is the final gatekeeper of your precious data. The application layer which is, by design ephemeral, cannot be its own watchdog. For the database to keep proper guard, the data model must be designed to reject data that does not belong, and to prevent relationships that do not make sense. Keys, foreign key relationships, and domain constraints, when described in a schema, are succinct, easy to understand and verify, and are ultimately self-documenting."

http://softarch.97things.oreilly.com/wiki/index.php/Database_as_a_Fortress

The Database Is IT. Period. No, Exclamation Point!

Duke Ganote, March 31, 2011 - 1:41 pm UTC

Amusing to read Andrew Wulf's fatally-flawed conclusion that the performance fault was "the stupid database". In this case, DB2 on AS/400... but the errors are obvious!

Read carefully. He complains that one "issue facing the Java team was that the [original programming] tool managed its constraints internal to the tool, and this was not available to any outside application including DB/2."

Duh. Constraints should always be in a central location. And that ain't some app server or programming tool. The one true location for constraints visible to everyone and enforced on everyone is the database.

http://thecodist.com/article/wtf-it-stories-1-its-not-the-database-stupid






Just a small data modeler

data_modeler, July 07, 2016 - 8:32 pm UTC

Well, I am a data modeler, maybe not a famous one. I am not a dba, I am not a developer and I dont like or dislike hybernate, it is just a tool like oracle is one. Is hybernate perfect, no. All it tries is to bring the java world and the relational world closer together, not an easy task. Is oracle rdbms perfect, no. There are bugs and logical failures in both tools. In my opinion, you can ignore everything here said about hybernate. It does not matter, cause it is not about hybernate.

I just want to tell you my point of view, when it comes down to model a database and in this case surrogate keys vs natural keys. Can a database work well with natural keys, yes. Can a database work well with surrogate keys, yes. But my advice is, always use surrogate keys. And here are my reasons for it:

- Tom said, "If you have an immutable primary key - a natural key - use it. If not, you'll need a surrogate.". Well for me that means, you always have to use surrogate keys, cause immutable natural keys do not exist. Most you can say is, I am pretty sure they are immutable, but not for 100%. As a developer, we do not control the environment we have to model. The environment can and will change. And in a good data model, the most importend point is control. With surrrogate keys, I am in control of my primary keys and of all foreign keys, all time, no matter if the environment is changing.
Marco from Spain said: "The values of generated surrogate keys have no relationship to the real meaning of the data held in a row." This is not a disadvantage, it is the biggest advantage a primary key can offer you.

- Can surrogate use more space, yes. But it really depends on how complex the natural keys are, how many relationships exists and how big the child tables are. So in some cases surrogate keys will use less space, even with Unique constraints on the natural columns. But at the end, the problems cause of the more space are not that huge in comparing problems natural keys can cause.

- Some will argue with natural keys you need less Joins in some cases. That is true, but the only reason why this is working is, cause you "hide" business data in the foreign keys, cause you have redundancy with natural keys in your database. I could do the same with none primary keys columns. That would save some Joins as well, for example with a unique not null alternate key column. But still it would be redundancy.

- If you use natural keys, specially composite natural keys, it is possible that you wont be able to create a foreign key constraint. For example you have some kind of date range in your natural key. Foreign keys dont match on ranges. You wont have that kind of problem with surrogate keys.

- If you use natural keys, the risk is very high, that you have to use surrogate keys for some entity types as well. And I dont like to mix it. Makes it more complex.

- With surrogate keys you can establish better naming conventions, what gives you a better look and feel for your model. For example PK_Tablename always machting on FK_Tablename. A look on the column names is enough, to know all the primary and foreign keys and which is the parent table of a foreign key.

- Lets see what properties a primal key should have:

-> The primary key must uniquely identify each record. 100% true for surrogate keys, but only most likely for natural keys.
-> A record’s primary-key value can’t be null, true for both I would say.
-> The primary key must remain stable—you can’t change the primary-key, 100% true for surrogate keys, but only most likely for natural keys
-> The primary key should be compact, 100% true for surrogate keys, but only perhabs for natural keys.

So if you compare the pros and cons for both keys, for me the surrogate keys wins. And the biggest reason for me is the constrolling, cause you do not have any buisiness data in your primary and foreign keys.

Just for the info, a data model can never ever garanty the integrity of the data, even not with foreign keys. The most you can do is supporting data integrity, not more.

Just the small
data modeler