Surrogate versus Natural Keys
February 27, 2008 - 1pm Central time zone
Reviewer: Monty from St. Louis, MO USA
So, if we create a table with a surrogate key; create a unique index on the natural key; (we want
to enforce data integrity) use only the columns in the unique index to select data for DML; ...
what was the reason for the surrogate key to begin with?
Followup February 27, 2008 - 7pm Central time zone:
indeed, what was the reason?
RE: "what was the reason for the surrogate key to begin with?"
February 27, 2008 - 4pm Central time zone
Reviewer: Duke Ganote from Amelia, Ohio USA
When you've got a data model with child tables that have child tables that are also grandchildren
of another table, then your "business key" (there's little "natural" about most keys... ok, off my
soap box) gets to be cumbersome. In other words, the "business key" encompasses many columns.
Also, suppose your so-called "immutable" business keys need to change. Why? Perhaps because your
company buys another one and you need to 'fit in' other data. With a surrogate key, you can change
your unique constraint "business key" to a function-based constraint based on some new columns or
conditional data -- without revising the entire data structure of child tables.
Followup February 27, 2008 - 7pm Central time zone:
tell me what to do then with:
document:
document_version:
document_version_to_author:
author:
I understand we might use a surrogate for document. And maybe for author.
but what about document version and document version to author please?
Why surrogate keys?
February 27, 2008 - 7pm Central time zone
Reviewer: Mike from Cleveland, OH USA
I think the question of 'natural versus artificial keys' is different from the question of 'should
child tables use the FK to their parent(s) in their keys?'
In either event, I think the bigger issue (that is, bigger than cumbersome compund keys) is: what
happens when a key needs to change?
With a true synthetic key, you should never have to change it - as the business should not know it
and therefore can never have a need to say it has the 'wrong' value.
With a natural key, the need may arise to update it. This gets much worse if there are child
tables, as those rows also have to be located and updated. Throw in some grandchildren and
great-grandchildren, stir in a handful of non-deferred foreign-key constraints and you have a real
undertaking there.
Having said that, I prefer to use natural keys when they are available. But I have found more and
more that the 'good, well-behaved' natural keys are rarely that - especially because there is
usually some obscure case where they need to be updated. So I have found myself using artificial
keys more and more often.
Followup February 27, 2008 - 7pm Central time zone:
.. I think the question of 'natural versus artificial keys' is different from the
question of 'should child tables use the FK to their parent(s) in their keys?'
...
interesting, really? If you were in a shop that said:
"The project I'm currently working on has the DB design rule that all tables must have a surrogate key."
would you still say that? They didn't say "every table that isn't a child", they said "EVERY TABLE (eg: including the evil child table) must have a surrogate key.
So why change the question here?
... With a natural key, the need may arise to update it. ...
fail: take the document table with the 'sugar coated key of document id' - whatever that is, and the document version table.... do you need to update it?
that "we'll buy your company and we'll change your keys" does not fly with me either, that is a one time "we need to MOVE data from db1 to db3 (there is no db2 you see.....)", you obviously can rekey... You have to - there will be duplicates..
but tell me - how often does company A buy a company B that uses the SAME EXACT OFFERING OF SOFTWARE to even make this a consideration????? (I will guess: about ZERO, what do you guess??)
Let us stick with reality - an application, with keys that are provably immutable (the vast majority by the way...)
now what?

February 27, 2008 - 8pm Central time zone
Reviewer: David Aldridge from Fairfax, VA
Hmmm, what an interesting topic ...
Overuse of synthetic keys can of course require that queries include more joins in some
circumstances, hence require more work to execute, in order to make a usable result set. Thus a
subset of queries can end up paying a small tax for that design approach.
Underuse can require that some ddl runs enormously longer, as child records have to be modified, or
you have to take a hit on development plus a maintenance outage to change a key from natural to
synthetic.
You just have to "pick your poison".
Followup February 28, 2008 - 3pm Central time zone:
as in "it depends" :)
Immutable Natural Keys
February 28, 2008 - 9am Central time zone
Reviewer: SeánMacGC from Ireland
One of the most inscrutable uses of surrogate keys, predominantly in data warehousing environments, is for DATE surrogate keys, where some folks deem it compulsory, apparently.
Not sure what the designers are fearing here, given that once a particular date always a particular date, in all systems -- a reversion to the Julian Calendar perhaps? (That's not to say, of course, that an event didn't occur on different date to that originally assigned)
Followup February 28, 2008 - 11pm Central time zone:
... (That's not to say, of course, that an event didn't occur on different date to that originally assigned) ..
right, but that would require an update regardless of the use of the surrogate - in fact, one might say a "harder update", you cannot just fix a date from 15-mar-2007 to 25-jun-2007 with an update if you used surrogates, you have to look up what magic number you assigned to 25-jun-2007 first.
My 2 cents, or 20 cents...
February 28, 2008 - 12pm Central time zone
Reviewer: Bill from Cedar Hills, UT
Just reviewed your other surrogate/natural topics on AskTom. Looks like we are in agreement: would be nice to mostly use surrogate keys, but sometimes it depends on the nature of the data, the business, the anticipated use, etc.
In the ideal world found only in my head, we would have surrogate keys on every entity. Makes for easier join query writing, automation and a few other niceties.
But reality frequently gets in the way.
In my first job as an overcharged but underpaid Andersen Consulting grunt, the tables we used for a large telco had no artificial keys. So every table had a 3 to 8 column natural key. Other than that, the model was fairly well normalized. So in our code, where we needed to join 5 to 8 tables together, we frequently ended up with 30 to 40 lines just dedicated to the predicate to get them hooked up. Since Andersen's unit testing standard demanded a separate test condition for every permutation of a where clause, including NULL and mix/max values for each, I quickly became an undying fan of single column surrogate keys.
But I've seen it taken too far as well, creating surrogate keys for everything, including associative/cross-ref entities -- where they really didn't help anything, were never used and got in the way.
Couple more stories...
My last company's previous modeler insisted on using string-based short codes as the PK for most reference/lookup tables, mainly, they told me, to make querying easier for developers. They kept patting me on the head and telling me not to worry about meaningless PKs, that the codes would never change. 3 years later they wanted to change one of the codes, and found the update to the child tables would take 6 to 20 hours at every customer's site. Surrogate keys finally started to make sense to my technical architect who couldn't be convinced until that day came.
They sold a software system that was, for the most part, similar for all the participants in the energy industry. But there were enough variations that we had to maintain separate copies of the UI and data schema for the nine major markets. Underneath that, we had a common data model for the things that the nine markets shared. Some of these entities were partitioned and handled huge data volume.
Unfortunately the modeler that preceded me decided to use a multi-column, local natural key for these tables. Every once in a while, we would sell the product to a new customer that had new uniqueness requirements that didn't match the common model. So we would have to write a script that would re-create the enormous UK and roll it out to ALL the customers in ALL the markets to accommodate them. It meant hours of downtime for customers that shouldn't have been affected by a new customer across the country. I was able to convince them to at least start using market-specific UKs on the tables shared in common. Now I hear they may be factoring out the natural key into a master table, giving it a surrogate, and then using the surrogate as the PK for the partitioned beast. Should ease their woes.
So natural/business keys that change, can kill your business as well.
In the end, the judgment call for surrogate vs. natural depends and the circumstances. And that's where an experienced data architect/modeler is invaluable.
Followup February 28, 2008 - 11pm Central time zone:
the only answer that always makes sense to any question is....
it depends :)
Love These Religious Wars.
February 28, 2008 - 10pm Central time zone
Reviewer: Phil Singer from Beautiful Downtown Brighton
As has been said, an artificial key on every table is going to extremes. You end up with something
like this:
From the Ed Sullivan Data Centre in Souix City Iowa:
Top Ten List
Today's Topic: Top 10 Reasons to have an Artificial Key on Every Table:
10. You don't ever have to look for natural keys again.
9. Name your key column 'PK', and all your join predicates can become just 'NATURAL JOIN'
8. You can have every other column in a row duplicated many times, and still claim to have a
relational table.
7. A SELECT * in SQL*Plus looks just like what you get when you view it in MS Access.
6. CASE tools will give you this column automatically; 20% of your work is done with one click.
5. It makes it real easy to delete the duplicate rows you got because you didn't bother to put a
unique index on the natural key.
4. You can astound your friends with how complicated your m<->m association tables are.
3. Once people learn of one of your artificial keys, it becomes a natural key, and you get to
create another one.
2. The index on the artificial key on tables with a lot of updates/deletes is likely to become
large, with a lot of sparse blocks in the tail end. This will give you a lot of indexes which
could benefit from periodic rebuilding, thus justifying rebuilding all indexes on a weekly basis
(which is probably another corporate standard).
And, the number on reason to have an Artificial Key on Every Table:
1. This is how SQL Server does it.
Followup March 1, 2008 - 10am Central time zone:
very nice :)
I liked #3 and #1 in particular.
#9 would be a bug - as the PK in one table would be FK in the other and the natural join would do the wrongest of things...
Immutable Natural Keys
February 29, 2008 - 6am Central time zone
Reviewer: SeánMacGC from Ireland
Followup February 28, 2008 - 11pm US/Eastern:
... (That's not to say, of course, that an event didn't occur on different date to that originally assigned) ..
right, but that would require an update regardless of the use of the surrogate - in fact, one might say a "harder update", you cannot just fix a date from 15-mar-2007 to 25-jun-2007 with an update if you used surrogates, you have to look up what magic number you assigned to 25-jun-2007 first.
Exactly -- a totally pointless level of indirection and obfuscation.
refresh my memory...
March 5, 2008 - 8am Central time zone
Reviewer: Connor from Perth
I was reading something in the last couple of weeks, where a built-in dbms package (that did some
sort of table maintenance) could only except 1 or 2 column primary keys as input. Anything more
than 2 was not permitted...
Can't remember what it was ... Anyone ?
but that restriction may be an issue when deciding on natural vs surrogate
Surrogate key is just workaround
July 4, 2008 - 4am Central time zone
Reviewer: Zilvinas from Lithuania
In theory there is no any surrogates.
Business model does not have any surrogates.
RDBMS are made following the rules of theory.
What do you want to cheat with your surrogates? Business? RDBMS creators? Yourself?
One of main requirements of database is data integrity.
When designers use surrogate keys they often loose data integrity. If they will try not to then
they will end up with the same natural design, only that every table has one excess column and one
excess index.
If you design your database following theory, you will find rare cases where you need to use
workaround (surrogate key).
As for "never use business data for primary keys" I agree. Make ID's for you business "atom
objects". But from this point use natural design like in example with document versions.
As for "what if happens something what has a chance of happening of one billion'th" it's just
stupid. There's much bigger chance that you system will be obsolete way before it happens.
And one question. Many Web developers say they absolutely need id for every table. Are they not
experienced enough or they use some crappy frameworks or something else?

December 9, 2008 - 9am Central time zone
Reviewer: Oracle learner from Scotland
Hi Tom,
If I understand your argument correctly then if I have a table like the following:
ID Country_Code Country_Description
1 USA United States of America
2 UK United Kingdom
3 F France
4 D Germany
then I could dispense with the surrogate ID column and have a primary key on the 'Country_Code' column. However, since the Country_Description column is also unique should I add a unique constraint here to maintain data integrity, or am I missing the point?
Thanks
Followup December 9, 2008 - 2pm Central time zone:
do you have a business rule that says the country_description must be unique (probably you do). If so, you would want to enforce that rule - yes.

April 21, 2009 - 11am Central time zone
Reviewer: Kevin from St. Louis, MO, USA
Hi Tom,
The main rationale I see for introducing surrogate keys is the overhead of large foreign keys. A common form of example is:
create table files (
file_name varchar2(100) not null,
file_date date not null,
constraint files_pk primary key (file_name))
/
create table transactions (
file_name varchar2(100) not null,
account varchar2(20) not null,
amount number(15,2) not null,
constraint transactions_pk primary key (file_name, account),
constraint transactions_fk1 foreign key (file_name) references files(file_name) on delete cascade)
/
Which could also be modeled w/ surrogates as:
create table files_with_surrogate (
file_id number not null,
file_name varchar2(100) not null unique,
file_date date not null,
constraint files_ws_pk primary key (file_id))
/
create table transactions_with_surrogate (
file_id number not null,
account varchar2(20) not null,
amount number(15,2) not null,
constraint transactions_ws_pk primary key (file_id, account),
constraint transactions_ws_fk1 foreign key (file_id) references files_with_surrogate(file_id) on
delete cascade)
/
Assume file_names average ~100 bytes in size, and there are an average of 1 thousand transactions per file_name:
INSERT INTO FILES
SELECT RPAD('f',90,'f')||LEVEL, SYSDATE
FROM DUAL CONNECT BY LEVEL < 201
/
INSERT INTO TRANSACTIONS
WITH T1 AS (SELECT LEVEL N FROM DUAL CONNECT BY LEVEL < 1001)
SELECT FILE_NAME,
RPAD('A',10,'A') || TRIM(TO_CHAR(N,'0000000000')) C2,
N / 100
FROM FILES , T1
/
INSERT INTO FILES_WITH_SURROGATE
SELECT LEVEL, RPAD('f',90,'f')||LEVEL, SYSDATE
FROM DUAL CONNECT BY LEVEL < 201
/
INSERT INTO TRANSACTIONS_WITH_SURROGATE (FILE_ID, ACCOUNT, AMOUNT)
WITH T1 AS (SELECT LEVEL N FROM DUAL CONNECT BY LEVEL < 1001)
SELECT FILE_ID,
RPAD('A',10,'A') || TRIM(TO_CHAR(N,'0000000000')) C2,
N / 100
FROM FILES_WITH_SURROGATE, T1
/
COMMIT
/
BEGIN
DBMS_STATS.GATHER_TABLE_STATS ('APPS','FILES');
DBMS_STATS.GATHER_TABLE_STATS ('APPS','TRANSACTIONS',NULL,10);
DBMS_STATS.GATHER_TABLE_STATS ('APPS','FILES_WITH_SURROGATE');
DBMS_STATS.GATHER_TABLE_STATS ('APPS','TRANSACTIONS_WITH_SURROGATE',NULL,10);
END;
/
This is a situation where the surrogate key approach shines; the following are much more expensive for the natrual-key apporach compared to the surrogate-key "file_id" solution:
1) Space needed for transactions (27 Mb) vs transactions_with_surrogate (8 Mb)
2) Space needed for transactions_pk (53 Mb) vs transactions_ws_pk (13 Mb)
3) Overall cost of joins, e.g.:
SELECT F.FILE_DATE, SUM(AMOUNT)
FROM FILES F,
TRANSACTIONS T
WHERE F.FILE_NAME = T.FILE_NAME
GROUP BY F.FILE_DATE
recursive calls 1
db block gets 0
consistent gets 3459
physical reads 0
redo size 0
bytes sent via SQL*Net to client 279
bytes received via SQL*Net from client 230
SQL*Net roundtrips to/from client 2
sorts (memory) 1
sorts (disk) 0
vs.
SELECT F.FILE_DATE, SUM(AMOUNT)
FROM FILES_WITH_SURROGATE F,
TRANSACTIONS_WITH_SURROGATE T
WHERE F.FILE_ID = T.FILE_ID
GROUP BY F.FILE_DATE
db block gets 0
consistent gets 949
physical reads 0
redo size 0
bytes sent via SQL*Net to client 279
bytes received via SQL*Net from client 256
SQL*Net roundtrips to/from client 2
sorts (memory) 1
sorts (disk) 0
Either of the values "bytes per filename" or "transactions per file" can be increased to further the degradation on points 1-3 of using the natural key.
Having said that, I dislike the idea of complicating a data model for performance, and consider it a "last resort". Is there an alternative? Given the declaration of the foreign key relationship, it seems that behind the scenes, Oracle could implement the natural data model by
1) maintaining ROWID pointers from transactions to files; and
2) use behind-the-scenes joins to resolve the transactions.file_name value when it is called for.
Put another way, when I run the natural key DDL for TRANSACTSIONS above, Oracle could build this behind-the-scenes objects:
CREATE (hidden) TABLE RAW_TRANSACTIONS
(FILE_ROWID ROWID, ACCOUNT, AMOUNT,
CONSTRAINT P1 PRIMARY KEY (FILE_ROWID, ACCOUNT),
CONSTRAINT C1 FOREIGN KEY (FILE_ROWID)
REFERENCES FILES.ROWID)
/
CREATE (view-treated-as-visible-table) TABLE TRANSACTIONS
AS SELECT FILE_NAME, ACCOUNT, AMOUNT
FROM RAW_TRANSACTIONS, FILES
WHERE RAW_TRANSACTIONS.FILE_ROWID = FILES.ROWID
/
Then, the CBO would need the ability to rewrite this query:
SELECT F.FILE_DATE, SUM(AMOUNT)
FROM FILES F,
TRANSACTIONS T
WHERE F.FILE_NAME = T.FILE_NAME
GROUP BY F.FILE_DATE
as this:
SELECT F.FILE_DATE, SUM(AMOUNT)
FROM FILES F,
RAW_TRANSACTIONS T
WHERE F.ROWID = T.FILE_ROWID
GROUP BY F.FILE_DATE
(reminiscient of the fancy-footwork done with COMPRESS tables)
I'm aware that using ROWID / memory pointer data at the logical level is intrinsically unstable, and I'd not dream of declaring a transactions table with a FILE_ROWID column. However, as long as Oracle was doing the bookkeeping on the physical pointers, could this work? More pointedly, is there any chance Oracle does or might offer such a feature?
-Kevin
Followup April 21, 2009 - 3pm Central time zone:
In most cases, you would not be getting the entire set (think "emp and dept") so the consistent gets do not put me off so much (nor does the storage really).
How about this, what if in transactions you always needed to see the file_name.
when you take the natural key out, you HAVE to walk the surrogate back to the other table in order to get it (and many times you want it, so many times you end up joining by the surrogate in order to GET the natural key)
so, for every pro, there is a con, for every con there is a pro.
Funny thing life is, while I was reading this - I received an email (no joking, this really happened) and the subject was:
"use of surrogate keys"
Another (serious) downside to surrogate keys (in addition to being FORCED to walk back to the source table to pick up the natural key!) noted in that email to me (part of an ongoing discussion) was:
o Contention and I/O management
Massive insert rates usually require the primary key index to be maintained as part of the workings of the database. Using a sequence or encoded timestamp results in a right growing index and often becomes the single point of contention on a system.
definitely, that happens frequently. People start introducing things like reverse key indexes, hash partitioning - in efforts to reduce that.
And as stated in other articles (on asktom) and here, there is a tradeoff to be made. If you have a natural key that is truly 100 bytes on average (I would personally call that the edge case, the extreme case) then we might have something to discuss.
Surrogate keys as the rule - absolutely not.
Surrogate keys as something to consider in balance with everything else - quite so.

April 21, 2009 - 5pm Central time zone
Reviewer: A reader
Hi Tom - appreciate the thoughts and insights.
A more direct question - in the situation where, after weighing pros/cons, it is deemed "best" to
use a surrogate, would you think it useful in a future Oracle release to have a DDL keyword that
tells Oracle, "link these tables over this foreign key constraint using a hidden surrogate key".
The syntax might look something like:
create table transactions (
file_name varchar2(100) not null,
account varchar2(20) not null,
amount number(15,2) not null,
constraint transactions_pk primary key (file_name, account),
constraint transactions_fk1 foreign key (file_name) references files(file_name) <b>USE SURROGATE
KEY</b>
)
When used, there is an understanding that all the performance pro's and con's of surrogate logic
apply, *but* we let the database worry about the implementation details (including query rewrites
for natural-key joins), and proceed with all application designs as though we're just using a
natural key.
Or, is there a fundamental reason (some short-coming of the automated approach) that surrogate keys
should always be explicitly implemented by the database designers, managed by the applications, and
exposed in the data dictionary?
Followup April 21, 2009 - 5pm Central time zone:
we do this with nested tables today - using a 16 byte raw populated with sys_guid()
I am not a fan.
I am not a fan of 'magic' and 'side effects', I have no problem being painstakingly forward in my linear code showing exactly what is happening (I am on record many times saying "do not use nested tables to persist data in the database, use them in plsql code, but never in DDL)
I don't have a problem with building a data model - explicitly, with great detail, "on purpose".
Java
April 22, 2009 - 3pm Central time zone
Reviewer: Pat from Oakland, CA, USA
I am working on a project where everything is done with Java frameworks and Hibernate has been
selected to interface with the database.
I was told that ALL tables must have a surrogate key that is a VARCHAR2(32) because that is what
Hibernate uses internally and it cannot use a multi-column PK.
So, ALL our tables now have a single column surrogate key and we have unique indices on the natural
keys, to make sure Hibernate doesn't corrupt the data.
In my case, there was not even a case to discuss. Hibernate was selected and its way of working was
imposed onto the DB people.
I simply hope that no other application will need to access this DB directly, since everything is
now hidden inside Hibernate.
I was told that PL/SQL was a big NO because they wanted the code to be "database agnostic".
Experience taught me there is no point arguing with the Java people because they are not willing to
listen to anything that contradicts their gospel.
To finish up, here is a wonderful example of one of our important tables, the calendar that maps
the fiscal calendar used by the clients. Instead of using the date as the PK, we have an anonymous
serial number given by a sequence stored into a VARCHAR2(32). Simply brilliant! Thanks Hibernate.
Followup April 24, 2009 - 4pm Central time zone:
... I was told that ALL tables must have a surrogate key that is a VARCHAR2(32)
because that is what Hibernate uses internally and it cannot use a multi-column
PK.
...
that is a lie and your developers should be banned from the database for that reason alone.
... to make sure Hibernate doesn't corrupt the data.
...
no no, hibernate doesn't corrupt data, that is what your developers do.
Don't blame hibernate actually, it is your developers - and why are they in charge? I'll never get that.
Don't get me wrong, I'm not a fan of hibernate, but only because of the company it attracts.

April 22, 2009 - 6pm Central time zone
Reviewer: Lev from Calgary, Canada
Pat,
I've been there and I know that "database agnostic" = disaster.
There is no way you can avoid it. If your application
executes anything more complicated then select * from emp
it can't be database agnostic.
Hibernate is not that bad when used properly.
The problem is when you are using Hibernate you need to understand both
Hibernate AND database.
The biggest issue that above mentioned disaster could happen late - when almost all
development is done and you can't change anything.
The earlier it happens - the better.
That's what I got from my (very painful) experience.
1) Insist that you need database constraints (PKs, FKs etc).
2) If 1) is not possible - create scripts verifying
data quality and include them in nightly build. Let managers see the results.
3) Check sql ¿ queries that are generated by Hibernate. Sometimes it generates 3 queries instead
of one. Insist that developers should fix that.
4) And the most important - don't allow developers to work on empty database.
Populate database with at least 50% of real amount of data.
Java people like to wipe database, recreate database structure, insert 2-3 records in
several tables and than execute unittests. If there were no errors they think that they
are on the road to success.
They call it TDD.
Find a way how to execute their unittetsts on real amount of data at least once a day.
Usually management can understand importance of testing application in real environment.
And the last - you really need some good luck.
Believe me very soon you will hear that project needs another database (to be database agnostic) or
even worse - new DBA.
That's why good luck!
Lev
Followup April 24, 2009 - 4pm Central time zone:
good points all.
and the only reason #1 would not be possible is if the developers actually allow the data to be inconsistent in between commits (eg: deferrable constraints should ALWAYS work, unless the developers allow the data to be inconsistent between commits - which will obviously lead to data inconsistencies and the need for that "data checker" you wrote of)
MYTH: "You only need to unit test"
April 25, 2009 - 4pm Central time zone
Reviewer: Duke Ganote from Newtown, Ohio USA
If your developers say "you only need to unit test", they have no idea what TDD is.
http://pages.citebite.com/t1h4d1i7o6nwx
http://www.agiledata.org/essays/tdd.html
Follow up
May 6, 2009 - 9pm Central time zone
Reviewer: Pat from Oakland, CA
A quick follow up on what I wrote earlier:
1. We now are giving the system to users for serious testing and, of course, they find bugs. Well, the problem now is that everything is in Hibernate and I have absolutely no clue what query gets generated. I know I could ask for the ASCII strings, but I am a bit afraid of what I would be given...
2. We also have heavy duty processing written in Java that is using Hibernate to access the DB. We have a pretty decent volume of data to crunch regularly and of course, the Java code is multi-threaded (as it should be.) However, now and then, we have dead lock errors that pop up, and I have no clue where those are coming from. My hunch is that the Java code is relying on deferred commits from Hibernate and somehow, the sets the threads process overlap. Since Hibernate returned "ok" when asked to commit WITHOUT actually committing on the DB side (to get a better throughput probably), it still holds locks for a little while. Then, the thread receiving the "ok" from Hibernate moves on and, from time to time, is going to pick up more data to process that is overlapping with the rows being locked by Hibernate. If the thread is faster doing its job than Hibernate is doing its job before sending the UPDATE statements to the DB and committing, Oracle will detect a dead lock. This is just a hunch I have and I don't know for sure, because it is VERY difficult to reproduce, which, to me, indicates a race condition. I suspect they could turn Hibernate into a fully synchronous mode (Hibernate would not return OK until it has actually committed in the DB) and that would remove the dead lock problem, but I bet it would kill the performance. So, they probably decided to use Hibernate in asynchronous mode (it returns ok even before it committed the changes to the DB) and they are betting that the race condition won't happen too often.
3. Fortunately, we have FK's everywhere and we also have unique constraints on the natural keys, so we can make sure the data won't get corrupted.
4. This constraint of forcing an anonymous, single column PK also has consequences on the schema design. When you have a table, a child table and a grand child table, if you only join them with their respective anonymous PK, you can get yourself into a situation where you lose information on the dimensional side of things. This means that it becomes next to impossible to write MERGE queries on the grand children tables based on the actual dimensions that should be inherited. You now have to write a row by row approach to enforce "by hand" dimensional constraints that have been lost along the way.
I think a lot of Java developers have turned into religious fanatics who simply refuse to open their minds and want to be protected from the evil world of databases. :)
Had it been my call, I would have built a nice PL/SQL business API (not table based) that takes care of locking the rows I need to modify... But then, it would not be "database agnostic" which is probably one of the biggest lies ever, along with "write once run everywhere". But what do I know, I am just a DB guy. :)
Followup May 11, 2009 - 11am Central time zone:
... We have a pretty decent volume of data to crunch regularly and of course, the Java code is multi-threaded (as it should be.) ...
I would disagree with that, the database is a pretty good 'parallel processing thing'. The data crunching should be single threaded with PARALLEL QUERY.
opinion: most developers are not capable of writing thread safe applications, reasoning for my thinking that: most are not capable of writing a trigger that is multi-user safe, they do not think about race conditions - what happens when 50 thing happen at once.
and reading the rest of #2 confirms this opinion.
DB agnostic code
November 5, 2009 - 6pm Central time zone
Reviewer: Pat from Oakland, CA
One of the key features "sold" by the Hibernate (and other ORM frameworks) users is that the code
they write with it will magically become database agnostic.
In other words, the theory is that you could write an application with Hibernate with Oracle as a
DB engine and then, decide to use MySQL or SQL Server instead. In theory, you would just need to
recreate the tables, migrate the data and you don't need to change anything in the Java code.
I have to admit I have a hard time getting my head around this concept. Since Oracle implements
"read consistency" everywhere all the time (readers don't block writers and vice versa) the Java
code will have been built tested and debugged with that paradigm baked into the handling of
transactions. If you move to another DB engine like MySQL for instance (which, as far as I know,
does NOT implement read consistency), I don't understand how the exact same code will be able to
generate exactly the same results all the time, especially when stressed by many concurrent users.
Does anybody have real world experience with swapping out the DB engine running under a Java code
base using Hibernate? What were the results?
Followup November 11, 2009 - 1pm Central time zone:
I've had a hibernate developer tell me:
that is ok, you just check the 'transactional' box and it'll fix all of your code.
I showed him a piece of logic that works provably in sqlserver - a bit of logic to pull changes from one system to another.
I showed him how that same logic would not work correctly in Oracle - using any isolation mode, the LOGIC had to change.
He said "hibernate would do that, just check transactional"
That is when I knew it was pointless to discuss, that developer had no idea really how the database worked or what transactional meant really - they were just told to check that box and all would be ok.
Hibernate-generated schema
January 29, 2010 - 11am Central time zone
Reviewer: C. W. from Seattle, WA
As a java developer (and a hibernate developer too), anyone who would suggest relying blindly on
Hibernate to generate a good schema based only off of the object domain model is drinking the
kool-aid. That's not the intent of the Hibernate people. From the book "Java Persistence with
Hibernate" which is the revised edition of "Hibernate in Action" written by the Hibernate code
authors, page 377: "you should always ask your DBA to finalize your database schema." And then, "We
recommend that DBAs get involved early and verify the automatically generated DDL from Hibernate."
And finally, "a professional DBA verifies and optimizes the SQL DDL and creates the final database
schema."
Followup January 29, 2010 - 4pm Central time zone:
wonder how many hibernate developers
a) read the book
b) follow it.
In my experience, the number is in the single digits. especially for (b)

January 30, 2010 - 4pm Central time zone
Reviewer: Oleksandr Alesinskyy
wonder how many hibernate developers
a) read the book
b) follow it.
Quite a lot - it is a kind of "Hibernate bible".
As for "Hibernate developer saying 'check a box'" from your previous comment - he was not a Hibernate developer at all, as Hibernate as such does no provide any user interface and so has not 'boxes to check'.
Followup February 1, 2010 - 10am Central time zone:
Maybe you think they do (a).
I know they do not do (b).
to "check a box" is a saying, an idiom. We "check boxes" to signify "they are done"
they enabled hibernate transactions:
https://www.hibernate.org/42.html
and said "it is all good, we are transactional now", with a very big smile on their face.
Again, I haven't met the kind of hibernate developer you talk of. I meet the ones that say "hibernate does that for us, we don't need to know what the database is or does, we use transactional"
I'm reminded of "it's what plants crave" from a favorite movie of mine when talking to them
http://www.imdb.com/title/tt0387808/quotes#qt0427932
@Oleksandr
January 30, 2010 - 5pm Central time zone
Reviewer: Enrique Aviles from FL
"Quite a lot - it is a kind of "Hibernate bible"."
I bet they follow the book a lot for Hibernate specific information, not necessarily for the referenced comments related to DBAs and database schemas.
@Enrique Aviles
January 31, 2010 - 3pm Central time zone
Reviewer: Oleksandr Alesinskyy
I bet they follow the book a lot for Hibernate specific information, not necessarily for the referenced comments related to DBAs and database schemas
It is hard to miss as it repeated (with minor variations) all over the book, several dozen times. Invest a couple of hours in skimming over that book.

February 1, 2010 - 11am Central time zone
Reviewer: Alexander
Tom,
Do you still have that example you used, the one that showed something that would work in sqlserver
but not in oracle? Is it something you could share?
Followup February 1, 2010 - 12pm Central time zone:
I have examples in every book I've written, but the one in this case is outlined here:
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:16998677475837
basically - it is a 'warehouse refresh', done in a database independent fashion.
It would "work" by having a trigger timestamp the time every record was last modified.
what would happen is, the refresh process would:
a) record what time it is RIGHT NOW into a table.
b) read out the last time it refreshed from that table (the prior (a) it had recorded)
c) pull every record that was modified since the time queried in (b) and apply it to the warehouse
d) commit
and do that over and over. Thinking that all changes would be pulled.
Well, let's say that the systems were in sync for a minute (no transactions and we made sure both systems are IDENTICAL). So far, so good.
The last time we refreshed was 8am today, recorded in our table.
At 9am, we record in this table "9am" and read out 8am.
We pull everything modified since 8am and apply to warehouse
we delete the 8am record
we commit.
We are in sync right? Well, no - because what I didn't tell you was that at 8:59:59am - I started a transaction. The record would be timestamped with 8:59:59amam - but since your select started at 8am (before I committed) also (at about or near the same time, doesn't have to be precise, it could in fact be far apart!) - your select would NOT see my modification. You would not block on the updated and uncommitted row - and if it were committed - you would not see that value - read consistency and all. So, your select would skip it.
And at 10am, you would read out 9am and miss it *again*, in fact you would miss it forever. Over time, the data would diverge in the two systems further and further. Spurious data inconsistencies would be reported by end users and then finally someone would look and find hundreds/thousands of inconsistencies. You (developer) would blame the database, sync the data and all would be well for a while - until it happens all over again.
In sqlserver - by default - your select at 9am would either hit the record before it was committed - get blocked and then read it, or it would hit it after I commit and since they by default do not do a consistent read - you would just read whatever you saw there.
Same logic, different outcome.

February 1, 2010 - 11am Central time zone
Reviewer: John Gilmore from London, England
...
I showed him a piece of logic that works provably in sqlserver - a bit of logic to pull changes from one system to another.
I showed him how that same logic would not work correctly in Oracle - using any isolation mode, the LOGIC had to change
...
I was just wondering if you could share the details of the logic with us, I'd be very interested to see it.
Followup February 1, 2010 - 12pm Central time zone:
there are an infinite number of examples..... Just think of any multiuser situation with concurrent reads and writes of data.
Here is another one: reporting on the amount of money in the bank. What works very well in Oracle doesn't work at all in Sqlserver (by default)
see
http://www.oracle.com/technology/oramag/oracle/05-nov/o65asktom.html
natural keys
February 1, 2010 - 8pm Central time zone
Reviewer: A reader
What is next?
March 3, 2010 - 2am Central time zone
Reviewer: A reader
Hi Tom,
Okay we made the decision to use some generated keys - now the question is - do you see benefits of one over another for the keys below?
1) Surrogate keys (or synthetic keys or artificial keys)
2) Intelligent Keys (or smart keys)
where surrogate keys are just sequences over source natural key, source system id, and effective date (model where we track changes) but intelligent key is generated through combination of source natural key, sources system id, and effective (in the format YYYYMMDDHH24MISS) by concatenation
Thanks
Followup March 3, 2010 - 7am Central time zone:
I call those intelligent keys "so called intelligent"
I would stay as far away from them as possible....
Now, if your intelligent key isn't really an intelligent key but a NATURAL KEY - we'd be having a different discussion.
You use both terms - seemingly interchangeable. If you have an immutable primary key - a natural key - use it. If not, you'll need a surrogate.
You would NOT use concatenation - that would create a non-intelligent, dumb key.
You would use a compound primary key, the three attributes. That would be a smart, primary, natural key.
Meaningless Key
March 5, 2010 - 2am Central time zone
Reviewer: A reader
Hi Tom,
Whether it is dumb key or surrogate (meaningless key), is there any benefit of one over other? One benefit I can see is with surrogate key it could potentially occupy less storage i.e., the size of the column can be controlled and made consistent throughout the model:-)
Thanks
Followup March 5, 2010 - 5am Central time zone:
but, umm, you have to store the data don't you.
which means if you use a surrogate, your storage needs are:
a) size of natural key
PLUS
b) size of surrogate
whereas if you do not use surrogate, your storage in the table is
a) size of natural key
and you'd need to have two indexes in the first case (one for primary key, one for unique constraint on natural key).
So, it could potentially occupy MORE storage too.
Some problems when using subrrogate keys
December 10, 2010 - 6am Central time zone
Reviewer: Marco from Spain
First, thank you very much for the interesting thread.
I find natural keys better than subrrogate keys for the following reasons:
- The values of generated surrogate keys have no relationship to the real meaning of the data held
in a row.
- Since surrogate keys replace identifying attributes and since the identifying attributes are
likely to be those queried, then the query optimizer is forced to perform a full table scan when
fulfilling likely queries.
- The presence of a surrogate key can result in the database administrator forgetting to establish,
or accidentally removing, a secondary unique index on the natural key of the table. Without a
unique index on the natural key, duplicate rows can appear and once present can be difficult to
identify.
- Because surrogate keys are unnatural, flaws can appear when modeling the business requirements.
- Proprietary information can be leaked if sequential key generators are used.
However, here I'd like to explain a particular case where using subrrogate keys also involves in
data inconsistency. Please, consider the following example:
create table providers
(
prov_id number(6) not null,
country_id varchar2(2) not null,
ins_date date,
created_by varchar2(10)
);
alter table v add constraint providers_pk primary key (prov_id, country_id) using index ;
create table services4country
(
serv_id number(4) not null,
country_id varchar2(2) not null,
ins_date date,
created_by varchar2(8),
description varchar2(80)
);
alter table services4country add constraint services4country_pk primary key (serv_id, country_id)
using index;
create table emails_providers
(
prov_id number not null,
serv_id number(4) not null,
country_id varchar2(2) not null,
pwd varchar2(20) not null,
email varchar2(100)
);
alter table emails_providers add constraint emails_providers_pk primary key (prov_id, serv_id,
country_id) using index ;
alter table emails_providers add constraint emails_providers_fk_01 foreign key (serv_id,
country_id) references services4country (serv_id, country_id);
alter table emails_providers add constraint emails_providers_fk_02 foreign key (prov_id,
country_id) references proveedores_web (prov_id, country_id);
If you decide to migrate the natural keys of these tables to subrrogate keys, for example because
of implementation simplicity with Hibernate you will allow data inconsistency on table
emails_providers because the value of the field country_id could be diferent on the other two
tables.
Best regards;
Marco
DBAs need to calm down on Hibernate
March 28, 2011 - 10am Central time zone
Reviewer: A reader
I see no reason using a natural key. It just complicates db design and further Application design
and all other queries. The last 6 projects I worked on, I used Hibernate to generate the schema
design, yes with perfect indexes, and they run in production just fine. See this old but useful
blog post:
http://www.techrepublic.com/article/the-great-primary-key-debate/1045050
DBAs think that Developers are dumb and they need to calm down.
DBAs did noy create Hibernate or JPA. DBAs did not create MongoDb or MapReduce or Bigtable. DBAs
did not create unit testing methodoglies using dbunit and spring. They probably are scratching
their heads with all the jargon because they DON'T learn new technologies. Why would they? SQL
never changes. Sorry that you're losing your job but reality changes quickly just like technology.
I have perhaps worked with one really smart DBA in my entire career and that was seven years ago.
It's frustrating to see DBA's come and tell developers how to name columns among other things when
they don't understand what modern frameworks and agile practices can do.
In summary, there might be a good DBA and a bad developer, but most times, using a good ORM stack
with best practices often eliminates the need for a DBA and their old failed approaches of db
design. All low level details that concern Transaction isolation or db centric behaviour that can
very well be abstracted at application configuration.
Followup March 29, 2011 - 3am Central time zone:
Developers need to calm down on using hibernate.
there we are even.
I'm a developer by the way, not a DBA and I share the trepidation of many a DBA over the "hibernate" thing.
I see no reason using a natural key. It just complicates db design and further
Application design and all other queries.
I see no reason not to use a natural key when one presents itself. It eases greatly the database design and makes query generation simple. Now what? That is my experience. Seems different than yours.
DBAs think that Developers are dumb and they need to calm down.
Ummm, no they don't. They think a one size fits all, they think a "the application is the end goal", they think that "performance is the job of the DBA" is dumb. They don't think an entire class of people are dumb, they think people that think like that are dumb.
DBAs did noy create Hibernate or JPA. - NO KIDDING :)
DBAs did not create MongoDb or MapReduce
or Bigtable. - guess what - DBAs did not create Oracle !!!! DBAs did not create the RDBMS theory or practice. Now what???? Your point is meaningless.
They probably are scratching their heads with all the jargon because
they DON'T learn new technologies.
Now, who is calling who dumb??
SQL never changes. Sorry
that you're losing your job but reality changes quickly just like technology.
Now I'll call you myopic, or dumb, or just "head in a hole in the ground". That was a pretty stupid statement. Perhaps you should read a thing called a new features guide. We put out one (big) with each release - in the last 10 years we've gone from 8.1.5, to 8.1.6, 8.1.7, 9iR1, 9iR2, 10gR1, 10gR2, 11gR1, 11gR2 - that would be nine major releases - each with, well, hundreds of new things. SQL doesn't change? I believe you are the one that hasn't kept up.
but most times,
using a good ORM stack with best practices often eliminates the need for a DBA
and their old failed approaches of db design.
old failed? huh, interesting, since MOST of the databases are in an RDBMS format, since your paycheck comes from one of them, since your credit card transactions are definitely in one, since your reservations for hotels, etc etc etc - are in them.
And DBA's do not really do the database design, that is a function of a data modeler - DBA's do *administration*, they also know how to use the database in the best way in general (not always, but most of the time)
And how would an ORM stack eliminate the need for a DBA - that my friend is a stupid statement.
You don't need installation, upgrade, maintenance? You don't need backup/recovery? You don't need monitoring? I think you don't get the bigger picture - at all.
To "A Reader"
March 29, 2011 - 7am Central time zone
Reviewer: Kevin from South Carolina
I am going to attach a link that I discovered through Tom's blog. A developer who gets what the database is and is for - AND how that an object stack on top of a database really is the wrong way to go about it.
Oh and he touches on determining keys as well ( just so this stays on topic :-) )
http://database-programmer.blogspot.com/
To A Reader
March 29, 2011 - 9am Central time zone
Reviewer: Bill S. from New England, USA
"DBAs think that Developers are dumb and they need to calm down. "
Wow, I'm in deep trouble. I spent 10 years as a developer and the last eight as a developer/DBA.
I don't use Hibernate, I think tools that decide the best SQL for the job without understanding the
job are dangerous.
So I guess I think I'm dumb.
Oh Really?
March 29, 2011 - 12pm Central time zone
Reviewer: Galen Boyer from Boston
> I see no reason using a natural key.
Its a discussion. I can see reasons for both sides of the coin.
> It just complicates db design
How exactly? Seems like someone said these words to you. Please give
an example of why natural keys are "more complicated" than surrogates.
> and further Application design
Here's the crux, isn't. The application wants a single id to
represent the row of data it is looking at, ie, the data that the
object represents. Why doesn't the application allow for the "id" to
have more than one attribute? Why doesn't the application allow for
the "id" to be a character?
> and all other queries.
How exactly? Seems like someone said these words to you. Please give
an example of why natural keys are "more complicated" than surrogates.
> The last 6 projects I worked on, I used Hibernate to
> generate the schema design, yes with perfect indexes
Exactly what is a "perfect" index?
> and they run in production just fine. See this old but useful blog post:
> http://www.techrepublic.com/article/the-great-primary-key-debate
Well, in the first paragraph we read "Technically, there is no right
or wrong to this debate only very strong opinions". So, hm..., you
see "no reason" but the article you site already says there are
reasons.
But, then the article says, "The main purpose of a primary key is to
relate records to additional data stored in other tables". That is
not the main purpose of a primary key, sorry, it just is incorrect.
First things first, a primary key is just a key. The fact that
someone decided that a table must have a primary key is really wrong
in all databases. But, either way, it represents a relational key.
And a key is forced on you because you are building a relation.
I also see a little bit later he writes in the following, "The primary
key must be compact and contain the fewest possible attributes." Yet
again, where does relational theory say anything about this? That's
just a statement made as though it is fact so he can make the
surrogate argument later.
He then does the, lets show a scenario and why natural keys fall
down. "Initially the name appears as a perfect candidate,
... therefore, you can't enter any of the employee's data until the
proper name is known." Hm... So, lets enter the record with nothing
to identify it except a surrogate key. Then save and move on. How
do we get that record back to work on it later? That record one is
entering wouldn't be in the customer table, it would be in some work
area, so, his convenient example isn't an example. Its more an
example that points out the rule that were you to go with surrogates,
then you need another key.
WHICH, I WILL ALWAYS SAY IS A RULE. IF YOU GO WITH SURROGATES, THEN
YOU MUST HAVE A NATURAL KEY DEFINED AS WELL.
Further down, he yet again says, "Used correctly (to establish
relationships),". This guy, and you I would guess, probably believe
that a foreign key means a relation. It doesn't. A foriegn key is a
constraint. A relation is the attributes within a row in a table.
Relational modeling is about forming that relation. Drawing the lines
between the tables isn't relational. Drawing lines between building
the tables in a model means you are building constraints within your
database. A foreign key does not relate two tables together. It
constraint the set of possible values in an attribute of a table with
the key of the other table.
> DBAs think that Developers are dumb and they need to calm down.
DBAs think developers want to not think about the database, and in
most cases they are correct. Developers want to be database ignorant
because their language says the code they write should be database
independent.
> DBAs did noy create Hibernate or JPA. DBAs did not create MongoDb or
> MapReduce or Bigtable. DBAs did not create unit testing methodoglies
> using dbunit and spring. They probably are scratching their heads
> with all the jargon because they DON'T learn new technologies. Why
> would they? SQL never changes. Sorry that you're losing your job but
> reality changes quickly just like technology.
> I have perhaps worked with one really smart DBA in my entire career
> and that was seven years ago. It's frustrating to see DBA's come
> and tell developers how to name columns among other things when they
> don't understand what modern frameworks and agile practices can do.
Wow. Can't live by a standard set forth by someone else? What
production shops are you in?
> In summary, there might be a good DBA and a bad developer, but most
> times, using a good ORM stack with best practices often eliminates
> the need for a DBA
Yeah, eliminates. Sure.
> and their old failed approaches of db design. All low level details
> that concern Transaction isolation or db centric behaviour that can
> very well be abstracted at application configuration.
Look, I'm a hibernate fan. I'm a java developer fan. I love that
language. But, I truly can't stand developer statements that you are
spouting. Java hasn't replaced the need for dbas. Java has increased
the need for that application developer to be better at databases.
Its just too bad that developers think hibernate solves that for them.
While he is Hibernating, Spring this on him
March 30, 2011 - 3pm Central time zone
Reviewer: Duke Ganote from 39.08,-84.33
"Database as a Fortress"
#23 of "97 Things Every Software Architect Should Know"
-----------------------------------
"A solid data model is one that guarantees security of today’s data, but also extensible for
tomorrow’s. Guaranteeing security means being impervious to bugs that will – despite your best
efforts – be pervasive in an ever-changing application layer. It means enforcing referential
integrity. It means building in domain constraints wherever they are known. It means choosing
appropriate keys that help you ensure your data’s referential integrity and constraint
satisfaction. Being extensible for tomorrow means properly normalizing your data so that you can
easily add architectural layers upon your data model later. It means not taking shortcuts.
"The database is the final gatekeeper of your precious data. The application layer which is, by
design ephemeral, cannot be its own watchdog. For the database to keep proper guard, the data model
must be designed to reject data that does not belong, and to prevent relationships that do not make
sense. Keys, foreign key relationships, and domain constraints, when described in a schema, are
succinct, easy to understand and verify, and are ultimately self-documenting."
http://softarch.97things.oreilly.com/wiki/index.php/Database_as_a_Fortress
The Database Is IT. Period. No, Exclamation Point!
March 31, 2011 - 1pm Central time zone
Reviewer: Duke Ganote from 39.08,-84.33
Amusing to read Andrew Wulf's fatally-flawed conclusion that the performance fault was "the stupid
database". In this case, DB2 on AS/400... but the errors are obvious!
Read carefully. He complains that one "issue facing the Java team was that the [original
programming] tool managed its constraints internal to the tool, and this was not available to any
outside application including DB/2."
Duh. Constraints should always be in a central location. And that ain't some app server or
programming tool. The one true location for constraints visible to everyone and enforced on
everyone is the database.
http://thecodist.com/article/wtf-it-stories-1-its-not-the-database-stupid
|