Home>Question Details



John -- Thanks for the question regarding "Surrogate versus Natural Keys", version 10.1.0

Submitted on 27-Feb-2008 8:53 Central time zone
Last updated 1-Feb-2010 12:12

You Asked

Hi Tom,

The project I'm currently working on has the DB design rule that all tables must have a surrogate key. This is the case even if a perfectly good natural key exists.

The primary motivation appears to be that of join efficiency by removing the possibility of having to join two tables on more than one column.

Personally I'm not a fan of surrogate keys in general, or of this sort of blanket policy in particular. I believe that there is often much to be gained by inheriting the PK of the parent table into the child table as part of the PK of that table.

Also, when two tables are joined by a surrogate key there is an immediate loss of semantic information which would exist if natural keys were used instead. For example, what's the "real" relationship between the two tables and which of the "real" columns are involved in this relationship?

One thing that annoys me with designs like this is that the natural key column(s) from the parent table are often denormalised into the child table anyway for reasons of performance. So the only gain is usually the performance difference between joining on a single column and joining on multiple columns.

I've always thought this would be a marginal difference in most cases. Do you have any comment about this in general? Do you favour surrogate keys over natural keys? How tight would performance considerations have to be to justify such a scheme?

and we said...

...
The project I'm currently working on has the DB design rule that all tables must have a surrogate key. This is the case even if a perfectly good natural key exists.

......

perfect example of why I despise the guise of "best practice", or ROT - rules of thumb. One persons 'best practice' is another persons nightmare.

....
The primary motivation appears to be that of join efficiency by removing the possibility of having to join two tables on more than one column.
......

hmmm, wonder if anyone ever tested that.... It would not make that much difference - if any - in real life. Especially in a transactional system. It might be true in some cases that the index on a single number would be "shorter" (less levels) than an index on a multi-part key - but that would be one extra logical IO - in most cases, I would say it would not be worth it (for the reasons you list, loss of information, loss of semantic meaning, loss of detail)

If you have a natural key, by all means - use it. The natural key should be immutable and sensible - they are sometimes "rare", but if I had a document table, document version table - I would definitely use the document_id (might be a surrogate) as the primary key of one and (document_id, version# ) as the primary key of the other (and document version might have an association to authors so it's primary key is a foreign key elsewhere in the system too)

http://asktom.oracle.com/pls/ask/search?p_string=%22natural+key%22

Reviews    
3 stars Surrogate versus Natural Keys   February 27, 2008 - 1pm Central time zone
Reviewer: Monty from St. Louis, MO USA
So, if we create a table with a surrogate key; create a unique index on the natural key; (we want 
to enforce data integrity) use only the columns in the unique index to select data for DML; ... 
what was the reason for the surrogate key to begin with? 


Followup   February 27, 2008 - 7pm Central time zone:

indeed, what was the reason? 
3 stars RE: "what was the reason for the surrogate key to begin with?"   February 27, 2008 - 4pm Central time zone
Reviewer: Duke Ganote from Amelia, Ohio USA
When you've got a data model with child tables that have child tables that are also grandchildren 
of another table, then your "business key" (there's little "natural" about most keys... ok, off my 
soap box) gets to be cumbersome.  In other words, the "business key" encompasses many columns.  

Also, suppose your so-called "immutable" business keys need to change.  Why? Perhaps because your 
company buys another one and you need to 'fit in' other data.  With a surrogate key, you can change 
your unique constraint "business key" to a function-based constraint based on some new columns or 
conditional data -- without revising the entire data structure of child tables.


Followup   February 27, 2008 - 7pm Central time zone:

tell me what to do then with:

document:

document_version:

document_version_to_author:

author:


I understand we might use a surrogate for document. And maybe for author.


but what about document version and document version to author please?




5 stars Why surrogate keys?   February 27, 2008 - 7pm Central time zone
Reviewer: Mike from Cleveland, OH USA
I think the question of 'natural versus artificial keys' is different from the question of 'should 
child tables use the FK to their parent(s) in their keys?'

In either event, I think the bigger issue (that is, bigger than cumbersome compund keys) is: what 
happens when a key needs to change?

With a true synthetic key, you should never have to change it - as the business should not know it 
and therefore can never have a need to say it has the 'wrong' value. 

With a natural key, the need may arise to update it. This gets much worse if there are child 
tables, as those rows also have to be located and updated. Throw in some grandchildren and 
great-grandchildren, stir in a handful of non-deferred foreign-key constraints and you have a real 
undertaking there.

Having said that, I prefer to use natural keys when they are available. But I have found more and 
more that the 'good, well-behaved' natural keys are rarely that - especially because there is 
usually some obscure case where they need to be updated. So I have found myself using artificial 
keys more and more often.


Followup   February 27, 2008 - 7pm Central time zone:

.. I think the question of 'natural versus artificial keys' is different from the
question of 'should child tables use the FK to their parent(s) in their keys?'
...

interesting, really? If you were in a shop that said:

"The project I'm currently working on has the DB design rule that all tables must have a surrogate key."

would you still say that? They didn't say "every table that isn't a child", they said "EVERY TABLE (eg: including the evil child table) must have a surrogate key.


So why change the question here?



... With a natural key, the need may arise to update it. ...

fail: take the document table with the 'sugar coated key of document id' - whatever that is, and the document version table.... do you need to update it?


that "we'll buy your company and we'll change your keys" does not fly with me either, that is a one time "we need to MOVE data from db1 to db3 (there is no db2 you see.....)", you obviously can rekey... You have to - there will be duplicates..

but tell me - how often does company A buy a company B that uses the SAME EXACT OFFERING OF SOFTWARE to even make this a consideration????? (I will guess: about ZERO, what do you guess??)

Let us stick with reality - an application, with keys that are provably immutable (the vast majority by the way...)

now what?
4 stars   February 27, 2008 - 8pm Central time zone
Reviewer: David Aldridge from Fairfax, VA
Hmmm, what an interesting topic ...

Overuse of synthetic keys can of course require that queries include more joins in some 
circumstances, hence require more work to execute, in order to make a usable result set. Thus a 
subset of queries can end up paying a small tax for that design approach.

Underuse can require that some ddl runs enormously longer, as child records have to be modified, or 
you have to take a hit on development plus a maintenance outage to change a key from natural to 
synthetic.

You just have to "pick your poison".


Followup   February 28, 2008 - 3pm Central time zone:

as in "it depends" :)
4 stars Immutable Natural Keys   February 28, 2008 - 9am Central time zone
Reviewer: SeánMacGC from Ireland
One of the most inscrutable uses of surrogate keys, predominantly in data warehousing environments, is for DATE surrogate keys, where some folks deem it compulsory, apparently.

Not sure what the designers are fearing here, given that once a particular date always a particular date, in all systems -- a reversion to the Julian Calendar perhaps? (That's not to say, of course, that an event didn't occur on different date to that originally assigned)

Followup   February 28, 2008 - 11pm Central time zone:

... (That's not to say, of course, that an event didn't occur on different date to that originally assigned) ..

right, but that would require an update regardless of the use of the surrogate - in fact, one might say a "harder update", you cannot just fix a date from 15-mar-2007 to 25-jun-2007 with an update if you used surrogates, you have to look up what magic number you assigned to 25-jun-2007 first.
5 stars My 2 cents, or 20 cents...   February 28, 2008 - 12pm Central time zone
Reviewer: Bill from Cedar Hills, UT
Just reviewed your other surrogate/natural topics on AskTom. Looks like we are in agreement: would be nice to mostly use surrogate keys, but sometimes it depends on the nature of the data, the business, the anticipated use, etc.

In the ideal world found only in my head, we would have surrogate keys on every entity. Makes for easier join query writing, automation and a few other niceties.

But reality frequently gets in the way.

In my first job as an overcharged but underpaid Andersen Consulting grunt, the tables we used for a large telco had no artificial keys. So every table had a 3 to 8 column natural key. Other than that, the model was fairly well normalized. So in our code, where we needed to join 5 to 8 tables together, we frequently ended up with 30 to 40 lines just dedicated to the predicate to get them hooked up. Since Andersen's unit testing standard demanded a separate test condition for every permutation of a where clause, including NULL and mix/max values for each, I quickly became an undying fan of single column surrogate keys.

But I've seen it taken too far as well, creating surrogate keys for everything, including associative/cross-ref entities -- where they really didn't help anything, were never used and got in the way.

Couple more stories...

My last company's previous modeler insisted on using string-based short codes as the PK for most reference/lookup tables, mainly, they told me, to make querying easier for developers. They kept patting me on the head and telling me not to worry about meaningless PKs, that the codes would never change. 3 years later they wanted to change one of the codes, and found the update to the child tables would take 6 to 20 hours at every customer's site. Surrogate keys finally started to make sense to my technical architect who couldn't be convinced until that day came.

They sold a software system that was, for the most part, similar for all the participants in the energy industry. But there were enough variations that we had to maintain separate copies of the UI and data schema for the nine major markets. Underneath that, we had a common data model for the things that the nine markets shared. Some of these entities were partitioned and handled huge data volume.

Unfortunately the modeler that preceded me decided to use a multi-column, local natural key for these tables. Every once in a while, we would sell the product to a new customer that had new uniqueness requirements that didn't match the common model. So we would have to write a script that would re-create the enormous UK and roll it out to ALL the customers in ALL the markets to accommodate them. It meant hours of downtime for customers that shouldn't have been affected by a new customer across the country. I was able to convince them to at least start using market-specific UKs on the tables shared in common. Now I hear they may be factoring out the natural key into a master table, giving it a surrogate, and then using the surrogate as the PK for the partitioned beast. Should ease their woes.

So natural/business keys that change, can kill your business as well.

In the end, the judgment call for surrogate vs. natural depends and the circumstances. And that's where an experienced data architect/modeler is invaluable.

Followup   February 28, 2008 - 11pm Central time zone:

the only answer that always makes sense to any question is....


it depends :)
5 stars Love These Religious Wars.   February 28, 2008 - 10pm Central time zone
Reviewer: Phil Singer from Beautiful Downtown Brighton
As has been said, an artificial key on every table is going to extremes.  You end up with something 
like this:

From the Ed Sullivan Data Centre in Souix City Iowa:

Top Ten List

Today's Topic:  Top 10 Reasons to have an Artificial Key on Every Table:

10.  You don't ever have to look for natural keys again.

9.   Name your key column 'PK', and all your join predicates can become just  'NATURAL JOIN'

8.  You can have every other column in a row duplicated many times, and still claim to have a 
relational table.

7.  A SELECT *  in SQL*Plus looks just like what you get when you view it in MS Access.

6.  CASE tools will give you this column automatically; 20% of your work is done with one click.

5.  It makes it real easy to delete the duplicate rows you got because you didn't bother to put a 
unique index on the natural key.

4.  You can astound your friends with how complicated your m<->m association tables are.

3.  Once people learn of one of your artificial keys, it becomes a natural key, and you get to 
create another one.

2.  The index on the artificial key on tables with a lot of updates/deletes is likely to become 
large, with a lot of sparse blocks in the tail end.  This will give you a lot of indexes which 
could benefit from periodic rebuilding, thus justifying rebuilding all indexes on a weekly basis 
(which is probably another corporate standard).

And, the number on reason to have an Artificial Key on Every Table:

1.  This is how SQL Server does it.


Followup   March 1, 2008 - 10am Central time zone:

very nice :)

I liked #3 and #1 in particular.

#9 would be a bug - as the PK in one table would be FK in the other and the natural join would do the wrongest of things...
4 stars Immutable Natural Keys   February 29, 2008 - 6am Central time zone
Reviewer: SeánMacGC from Ireland
Followup  February 28, 2008 - 11pm US/Eastern:
... (That's not to say, of course, that an event didn't occur on different date to that originally assigned) ..

right, but that would require an update regardless of the use of the surrogate - in fact, one might say a "harder update", you cannot just fix a date from 15-mar-2007 to 25-jun-2007 with an update if you used surrogates, you have to look up what magic number you assigned to 25-jun-2007 first.


Exactly -- a totally pointless level of indirection and obfuscation.

5 stars refresh my memory...   March 5, 2008 - 8am Central time zone
Reviewer: Connor from Perth
I was reading something in the last couple of weeks, where a built-in dbms package (that did some 
sort of table maintenance) could only except 1 or 2 column primary keys as input.  Anything more 
than 2 was not permitted...

Can't remember what it was ... Anyone ?

but that restriction may be an issue when deciding on natural vs surrogate



4 stars Surrogate key is just workaround   July 4, 2008 - 4am Central time zone
Reviewer: Zilvinas from Lithuania
In theory there is no any surrogates.
Business model does not have any surrogates.
RDBMS are made following the rules of theory.

What do you want to cheat with your surrogates? Business?  RDBMS creators? Yourself?

One of main requirements of database is data integrity.
When designers use surrogate keys they often loose data integrity. If they will try not to then 
they will end up with the same natural design, only that every table has one excess column and one 
excess index.

If you design your database following theory, you will find rare cases where you need to use 
workaround (surrogate key).

As for "never use business data for primary keys" I agree. Make ID's for you business "atom 
objects". But from this point use natural design like in example with document versions.

As for "what if happens something what has a chance of happening of one billion'th" it's just 
stupid. There's much bigger chance that you system will be obsolete way before it happens.

And one question. Many Web developers say they absolutely need id for every table. Are they not 
experienced enough or they use some crappy frameworks or something else?


5 stars   December 9, 2008 - 9am Central time zone
Reviewer: Oracle learner from Scotland
Hi Tom,

If I understand your argument correctly then if I have a table like the following:


ID Country_Code Country_Description
1  USA          United States of America
2  UK           United Kingdom
3  F            France
4  D            Germany


then I could dispense with the surrogate ID column and have a primary key on the 'Country_Code' column. However, since the Country_Description column is also unique should I add a unique constraint here to maintain data integrity, or am I missing the point?

Thanks


Followup   December 9, 2008 - 2pm Central time zone:

do you have a business rule that says the country_description must be unique (probably you do). If so, you would want to enforce that rule - yes.



5 stars   April 21, 2009 - 11am Central time zone
Reviewer: Kevin from St. Louis, MO, USA
Hi Tom,

The main rationale I see for introducing surrogate keys is the overhead of large foreign keys. A common form of example is:
create table files (
file_name varchar2(100) not null,
file_date date not null,
constraint files_pk primary key (file_name))
/
create table transactions (
file_name varchar2(100) not null,
account varchar2(20) not null,
amount  number(15,2) not null,
constraint transactions_pk primary key (file_name,  account),
constraint transactions_fk1 foreign key (file_name) references files(file_name) on delete cascade)
/

Which could also be modeled w/ surrogates as:

create table files_with_surrogate (
file_id  number not null,
file_name varchar2(100) not null unique,
file_date date not null,
constraint files_ws_pk primary key (file_id))
/
create table transactions_with_surrogate (
file_id number not null,
account varchar2(20) not null,
amount  number(15,2) not null,
constraint transactions_ws_pk primary key (file_id,  account),
constraint transactions_ws_fk1 foreign key (file_id) references files_with_surrogate(file_id) on 
delete cascade)
/


Assume file_names average ~100 bytes in size, and there are an average of 1 thousand transactions per file_name:
 
INSERT INTO FILES 
SELECT RPAD('f',90,'f')||LEVEL, SYSDATE 
FROM DUAL CONNECT BY LEVEL < 201
/
INSERT INTO TRANSACTIONS
WITH T1 AS (SELECT LEVEL N FROM DUAL CONNECT BY LEVEL < 1001)
SELECT FILE_NAME,   
       RPAD('A',10,'A') || TRIM(TO_CHAR(N,'0000000000')) C2, 
       N / 100
FROM FILES  , T1
/
INSERT INTO FILES_WITH_SURROGATE 
SELECT LEVEL, RPAD('f',90,'f')||LEVEL, SYSDATE 
FROM DUAL CONNECT BY LEVEL < 201
/
INSERT INTO TRANSACTIONS_WITH_SURROGATE (FILE_ID, ACCOUNT, AMOUNT)
WITH T1 AS (SELECT LEVEL N FROM DUAL CONNECT BY LEVEL < 1001)
SELECT FILE_ID,   
       RPAD('A',10,'A') || TRIM(TO_CHAR(N,'0000000000')) C2, 
       N / 100
FROM FILES_WITH_SURROGATE, T1
/
COMMIT
/
BEGIN
  DBMS_STATS.GATHER_TABLE_STATS ('APPS','FILES');
  DBMS_STATS.GATHER_TABLE_STATS ('APPS','TRANSACTIONS',NULL,10);
  DBMS_STATS.GATHER_TABLE_STATS ('APPS','FILES_WITH_SURROGATE');
  DBMS_STATS.GATHER_TABLE_STATS ('APPS','TRANSACTIONS_WITH_SURROGATE',NULL,10);
END;
/

This is a situation where the surrogate key approach shines; the following are much more expensive for the natrual-key apporach compared to the surrogate-key "file_id" solution:

1) Space needed for transactions (27 Mb) vs transactions_with_surrogate (8 Mb)

2) Space needed for transactions_pk (53 Mb) vs transactions_ws_pk (13 Mb)

3) Overall cost of joins, e.g.:
SELECT  F.FILE_DATE, SUM(AMOUNT) 
FROM    FILES F, 
        TRANSACTIONS T
WHERE   F.FILE_NAME = T.FILE_NAME
GROUP BY F.FILE_DATE


recursive calls  1
db block gets  0
consistent gets  3459
physical reads  0
redo size  0
bytes sent via SQL*Net to client  279
bytes received via SQL*Net from client  230
SQL*Net roundtrips to/from client  2
sorts (memory)  1
sorts (disk)  0

vs.

SELECT  F.FILE_DATE, SUM(AMOUNT) 
FROM    FILES_WITH_SURROGATE F, 
        TRANSACTIONS_WITH_SURROGATE T
WHERE   F.FILE_ID = T.FILE_ID
GROUP BY F.FILE_DATE


db block gets  0
consistent gets  949
physical reads  0
redo size  0
bytes sent via SQL*Net to client  279
bytes received via SQL*Net from client  256
SQL*Net roundtrips to/from client  2
sorts (memory)  1
sorts (disk)  0


Either of the values "bytes per filename" or "transactions per file" can be increased to further the degradation on points 1-3 of using the natural key.

Having said that, I dislike the idea of complicating a data model for performance, and consider it a "last resort". Is there an alternative? Given the declaration of the foreign key relationship, it seems that behind the scenes, Oracle could implement the natural data model by
1) maintaining ROWID pointers from transactions to files; and
2) use behind-the-scenes joins to resolve the transactions.file_name value when it is called for.

Put another way, when I run the natural key DDL for TRANSACTSIONS above, Oracle could build this behind-the-scenes objects:
CREATE (hidden) TABLE RAW_TRANSACTIONS
(FILE_ROWID ROWID, ACCOUNT, AMOUNT,
CONSTRAINT P1 PRIMARY KEY (FILE_ROWID, ACCOUNT),
CONSTRAINT C1 FOREIGN KEY (FILE_ROWID) 
    REFERENCES FILES.ROWID)
/
CREATE (view-treated-as-visible-table) TABLE TRANSACTIONS
AS SELECT FILE_NAME, ACCOUNT, AMOUNT
FROM RAW_TRANSACTIONS, FILES
WHERE RAW_TRANSACTIONS.FILE_ROWID = FILES.ROWID
/


Then, the CBO would need the ability to rewrite this query:
SELECT  F.FILE_DATE, SUM(AMOUNT) 
FROM    FILES F, 
        TRANSACTIONS T
WHERE   F.FILE_NAME = T.FILE_NAME
GROUP BY F.FILE_DATE


as this:

SELECT  F.FILE_DATE, SUM(AMOUNT) 
FROM    FILES F, 
        RAW_TRANSACTIONS T
WHERE   F.ROWID = T.FILE_ROWID
GROUP BY F.FILE_DATE


(reminiscient of the fancy-footwork done with COMPRESS tables)

I'm aware that using ROWID / memory pointer data at the logical level is intrinsically unstable, and I'd not dream of declaring a transactions table with a FILE_ROWID column. However, as long as Oracle was doing the bookkeeping on the physical pointers, could this work? More pointedly, is there any chance Oracle does or might offer such a feature?

-Kevin

Followup   April 21, 2009 - 3pm Central time zone:

In most cases, you would not be getting the entire set (think "emp and dept") so the consistent gets do not put me off so much (nor does the storage really).


How about this, what if in transactions you always needed to see the file_name.

when you take the natural key out, you HAVE to walk the surrogate back to the other table in order to get it (and many times you want it, so many times you end up joining by the surrogate in order to GET the natural key)

so, for every pro, there is a con, for every con there is a pro.

Funny thing life is, while I was reading this - I received an email (no joking, this really happened) and the subject was:

"use of surrogate keys"

Another (serious) downside to surrogate keys (in addition to being FORCED to walk back to the source table to pick up the natural key!) noted in that email to me (part of an ongoing discussion) was:


o Contention and I/O management

Massive insert rates usually require the primary key index to be maintained as part of the workings of the database. Using a sequence or encoded timestamp results in a right growing index and often becomes the single point of contention on a system.


definitely, that happens frequently. People start introducing things like reverse key indexes, hash partitioning - in efforts to reduce that.



And as stated in other articles (on asktom) and here, there is a tradeoff to be made. If you have a natural key that is truly 100 bytes on average (I would personally call that the edge case, the extreme case) then we might have something to discuss.

Surrogate keys as the rule - absolutely not.
Surrogate keys as something to consider in balance with everything else - quite so.





5 stars   April 21, 2009 - 5pm Central time zone
Reviewer: A reader 
Hi Tom - appreciate the thoughts and insights.  

A more direct question - in the situation where, after weighing pros/cons, it is deemed "best" to 
use a surrogate, would you think it useful in a future Oracle release to have a DDL keyword that 
tells Oracle, "link these tables over this foreign key constraint using a hidden surrogate key".  
The syntax might look something like:

create table transactions (
file_name varchar2(100) not null,
account varchar2(20) not null,
amount  number(15,2) not null,
constraint transactions_pk primary key (file_name,  account),
constraint transactions_fk1 foreign key (file_name) references files(file_name) <b>USE SURROGATE 
KEY</b>
)


When used, there is an understanding that all the performance pro's and con's of surrogate logic 
apply, *but* we let the database worry about the implementation details (including query rewrites 
for natural-key joins), and proceed with all application designs as though we're just using a 
natural key.

Or, is there a fundamental reason (some short-coming of the automated approach) that surrogate keys 
should always be explicitly implemented by the database designers, managed by the applications, and 
exposed in the data dictionary?



Followup   April 21, 2009 - 5pm Central time zone:

we do this with nested tables today - using a 16 byte raw populated with sys_guid()

I am not a fan.

I am not a fan of 'magic' and 'side effects', I have no problem being painstakingly forward in my linear code showing exactly what is happening (I am on record many times saying "do not use nested tables to persist data in the database, use them in plsql code, but never in DDL)


I don't have a problem with building a data model - explicitly, with great detail, "on purpose".
5 stars Java   April 22, 2009 - 3pm Central time zone
Reviewer: Pat from Oakland, CA, USA
I am working on a project where everything is done with Java frameworks and Hibernate has been 
selected to interface with the database.

I was told that ALL tables must have a surrogate key that is a VARCHAR2(32) because that is what 
Hibernate uses internally and it cannot use a multi-column PK.

So, ALL our tables now have a single column surrogate key and we have unique indices on the natural 
keys, to make sure Hibernate doesn't corrupt the data.

In my case, there was not even a case to discuss. Hibernate was selected and its way of working was 
imposed onto the DB people.

I simply hope that no other application will need to access this DB directly, since everything is 
now hidden inside Hibernate.

I was told that PL/SQL was a big NO because they wanted the code to be "database agnostic". 
Experience taught me there is no point arguing with the Java people because they are not willing to 
listen to anything that contradicts their gospel.

To finish up, here is a wonderful example of one of our important tables, the calendar that maps 
the fiscal calendar used by the clients. Instead of using the date as the PK, we have an anonymous 
serial number given by a sequence stored into a VARCHAR2(32). Simply brilliant! Thanks Hibernate.


Followup   April 24, 2009 - 4pm Central time zone:

... I was told that ALL tables must have a surrogate key that is a VARCHAR2(32)
because that is what Hibernate uses internally and it cannot use a multi-column
PK.
...

that is a lie and your developers should be banned from the database for that reason alone.

... to make sure Hibernate doesn't corrupt the data.
...

no no, hibernate doesn't corrupt data, that is what your developers do.


Don't blame hibernate actually, it is your developers - and why are they in charge? I'll never get that.

Don't get me wrong, I'm not a fan of hibernate, but only because of the company it attracts.
5 stars   April 22, 2009 - 6pm Central time zone
Reviewer: Lev from Calgary, Canada
Pat,

I've been there and I know that "database agnostic" = disaster.
There is no way you can avoid it. If your application
executes anything more complicated then select * from emp
it can't be database agnostic.

Hibernate is not that bad when used properly.
The problem is when you are using Hibernate you need to understand both 
Hibernate AND database.

The biggest issue that above mentioned disaster could happen late - when almost all
development is done and you can't change anything. 
The earlier it happens - the better.

That's what I got from my (very painful) experience.

1) Insist that you need database constraints (PKs, FKs etc). 

2) If 1) is not possible - create scripts verifying 
   data quality and include them in nightly build. Let managers see the results.

3) Check sql ¿ queries that are generated by Hibernate. Sometimes it generates 3 queries instead 
of one. Insist that developers should fix that.

4) And the most important - don't allow developers to work on empty database. 
Populate database with at least 50% of real amount of data. 
Java people like to wipe database, recreate database structure, insert 2-3 records in 
several tables and than execute unittests. If there were no errors they think that they 
are on the road to success.
They call it TDD. 
Find a way how to execute their unittetsts on real amount of data at least once a day. 
Usually management can understand importance of testing application in real environment.

And the last - you really need some good luck. 
Believe me very soon you will hear that project needs another database (to be database agnostic) or 
even worse - new DBA.

That's why good luck!
Lev


Followup   April 24, 2009 - 4pm Central time zone:

good points all.


and the only reason #1 would not be possible is if the developers actually allow the data to be inconsistent in between commits (eg: deferrable constraints should ALWAYS work, unless the developers allow the data to be inconsistent between commits - which will obviously lead to data inconsistencies and the need for that "data checker" you wrote of)
3 stars MYTH: "You only need to unit test"   April 25, 2009 - 4pm Central time zone
Reviewer: Duke Ganote from Newtown, Ohio USA
If your developers say "you only need to unit test", they have no idea what TDD is.

http://pages.citebite.com/t1h4d1i7o6nwx

http://www.agiledata.org/essays/tdd.html


5 stars Follow up   May 6, 2009 - 9pm Central time zone
Reviewer: Pat from Oakland, CA
A quick follow up on what I wrote earlier:

1. We now are giving the system to users for serious testing and, of course, they find bugs. Well, the problem now is that everything is in Hibernate and I have absolutely no clue what query gets generated. I know I could ask for the ASCII strings, but I am a bit afraid of what I would be given...

2. We also have heavy duty processing written in Java that is using Hibernate to access the DB. We have a pretty decent volume of data to crunch regularly and of course, the Java code is multi-threaded (as it should be.) However, now and then, we have dead lock errors that pop up, and I have no clue where those are coming from. My hunch is that the Java code is relying on deferred commits from Hibernate and somehow, the sets the threads process overlap. Since Hibernate returned "ok" when asked to commit WITHOUT actually committing on the DB side (to get a better throughput probably), it still holds locks for a little while. Then, the thread receiving the "ok" from Hibernate moves on and, from time to time, is going to pick up more data to process that is overlapping with the rows being locked by Hibernate. If the thread is faster doing its job than Hibernate is doing its job before sending the UPDATE statements to the DB and committing, Oracle will detect a dead lock. This is just a hunch I have and I don't know for sure, because it is VERY difficult to reproduce, which, to me, indicates a race condition. I suspect they could turn Hibernate into a fully synchronous mode (Hibernate would not return OK until it has actually committed in the DB) and that would remove the dead lock problem, but I bet it would kill the performance. So, they probably decided to use Hibernate in asynchronous mode (it returns ok even before it committed the changes to the DB) and they are betting that the race condition won't happen too often.

3. Fortunately, we have FK's everywhere and we also have unique constraints on the natural keys, so we can make sure the data won't get corrupted.

4. This constraint of forcing an anonymous, single column PK also has consequences on the schema design. When you have a table, a child table and a grand child table, if you only join them with their respective anonymous PK, you can get yourself into a situation where you lose information on the dimensional side of things. This means that it becomes next to impossible to write MERGE queries on the grand children tables based on the actual dimensions that should be inherited. You now have to write a row by row approach to enforce "by hand" dimensional constraints that have been lost along the way.

I think a lot of Java developers have turned into religious fanatics who simply refuse to open their minds and want to be protected from the evil world of databases. :)

Had it been my call, I would have built a nice PL/SQL business API (not table based) that takes care of locking the rows I need to modify... But then, it would not be "database agnostic" which is probably one of the biggest lies ever, along with "write once run everywhere". But what do I know, I am just a DB guy. :)

Followup   May 11, 2009 - 11am Central time zone:

... We have a pretty decent volume of data to crunch regularly and of course, the Java code is multi-threaded (as it should be.) ...

I would disagree with that, the database is a pretty good 'parallel processing thing'. The data crunching should be single threaded with PARALLEL QUERY.

opinion: most developers are not capable of writing thread safe applications, reasoning for my thinking that: most are not capable of writing a trigger that is multi-user safe, they do not think about race conditions - what happens when 50 thing happen at once.


and reading the rest of #2 confirms this opinion.



5 stars DB agnostic code   November 5, 2009 - 6pm Central time zone
Reviewer: Pat from Oakland, CA
One of the key features "sold" by the Hibernate (and other ORM frameworks) users is that the code 
they write with it will magically become database agnostic.

In other words, the theory is that you could write an application with Hibernate with Oracle as a 
DB engine and then, decide to use MySQL or SQL Server instead. In theory, you would just need to 
recreate the tables, migrate the data and you don't need to change anything in the Java code.

I have to admit I have a hard time getting my head around this concept. Since Oracle implements 
"read consistency" everywhere all the time (readers don't block writers and vice versa) the Java 
code will have been built tested and debugged with that paradigm baked into the handling of 
transactions. If you move to another DB engine like MySQL for instance (which, as far as I know, 
does NOT implement read consistency), I don't understand how the exact same code will be able to 
generate exactly the same results all the time, especially when stressed by many concurrent users.

Does anybody have real world experience with swapping out the DB engine running under a Java code 
base using Hibernate? What were the results?


Followup   November 11, 2009 - 1pm Central time zone:

I've had a hibernate developer tell me:

that is ok, you just check the 'transactional' box and it'll fix all of your code.


I showed him a piece of logic that works provably in sqlserver - a bit of logic to pull changes from one system to another.

I showed him how that same logic would not work correctly in Oracle - using any isolation mode, the LOGIC had to change.

He said "hibernate would do that, just check transactional"

That is when I knew it was pointless to discuss, that developer had no idea really how the database worked or what transactional meant really - they were just told to check that box and all would be ok.



3 stars Hibernate-generated schema   January 29, 2010 - 11am Central time zone
Reviewer: C. W. from Seattle, WA
As a java developer (and a hibernate developer too), anyone who would suggest relying blindly on 
Hibernate to generate a good schema based only off of the object domain model is drinking the 
kool-aid. That's not the intent of the Hibernate people. From the book "Java Persistence with 
Hibernate" which is the revised edition of "Hibernate in Action" written by the Hibernate code 
authors, page 377: "you should always ask your DBA to finalize your database schema." And then, "We 
recommend that DBAs get involved early and verify the automatically generated DDL from Hibernate."  
And finally, "a professional DBA verifies and optimizes the SQL DDL and creates the final database 
schema."


Followup   January 29, 2010 - 4pm Central time zone:

wonder how many hibernate developers

a) read the book
b) follow it.


In my experience, the number is in the single digits. especially for (b)
3 stars   January 30, 2010 - 4pm Central time zone
Reviewer: Oleksandr Alesinskyy 
wonder how many hibernate developers

a) read the book
b) follow it.

Quite a lot - it is a kind of "Hibernate bible".

As for "Hibernate developer saying 'check a box'" from your previous comment - he was not a Hibernate developer at all, as Hibernate as such does no provide any user interface and so has not 'boxes to check'.

Followup   February 1, 2010 - 10am Central time zone:

Maybe you think they do (a).
I know they do not do (b).


to "check a box" is a saying, an idiom. We "check boxes" to signify "they are done"

they enabled hibernate transactions:
https://www.hibernate.org/42.html


and said "it is all good, we are transactional now", with a very big smile on their face.


Again, I haven't met the kind of hibernate developer you talk of. I meet the ones that say "hibernate does that for us, we don't need to know what the database is or does, we use transactional"


I'm reminded of "it's what plants crave" from a favorite movie of mine when talking to them

http://www.imdb.com/title/tt0387808/quotes#qt0427932

3 stars @Oleksandr   January 30, 2010 - 5pm Central time zone
Reviewer: Enrique Aviles from FL

"Quite a lot - it is a kind of "Hibernate bible"."

I bet they follow the book a lot for Hibernate specific information, not necessarily for the referenced comments related to DBAs and database schemas.


3 stars @Enrique Aviles   January 31, 2010 - 3pm Central time zone
Reviewer: Oleksandr Alesinskyy 
I bet they follow the book a lot for Hibernate specific information, not necessarily for the referenced comments related to DBAs and database schemas

It is hard to miss as it repeated (with minor variations) all over the book, several dozen times. Invest a couple of hours in skimming over that book.

5 stars   February 1, 2010 - 11am Central time zone
Reviewer: Alexander 
Tom,

Do you still have that example you used, the one that showed something that would work in sqlserver 
but not in oracle?  Is it something you could share?


Followup   February 1, 2010 - 12pm Central time zone:

I have examples in every book I've written, but the one in this case is outlined here:

http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:16998677475837


basically - it is a 'warehouse refresh', done in a database independent fashion.

It would "work" by having a trigger timestamp the time every record was last modified.

what would happen is, the refresh process would:

a) record what time it is RIGHT NOW into a table.
b) read out the last time it refreshed from that table (the prior (a) it had recorded)
c) pull every record that was modified since the time queried in (b) and apply it to the warehouse
d) commit

and do that over and over. Thinking that all changes would be pulled.

Well, let's say that the systems were in sync for a minute (no transactions and we made sure both systems are IDENTICAL). So far, so good.

The last time we refreshed was 8am today, recorded in our table.
At 9am, we record in this table "9am" and read out 8am.
We pull everything modified since 8am and apply to warehouse
we delete the 8am record
we commit.

We are in sync right? Well, no - because what I didn't tell you was that at 8:59:59am - I started a transaction. The record would be timestamped with 8:59:59amam - but since your select started at 8am (before I committed) also (at about or near the same time, doesn't have to be precise, it could in fact be far apart!) - your select would NOT see my modification. You would not block on the updated and uncommitted row - and if it were committed - you would not see that value - read consistency and all. So, your select would skip it.

And at 10am, you would read out 9am and miss it *again*, in fact you would miss it forever. Over time, the data would diverge in the two systems further and further. Spurious data inconsistencies would be reported by end users and then finally someone would look and find hundreds/thousands of inconsistencies. You (developer) would blame the database, sync the data and all would be well for a while - until it happens all over again.



In sqlserver - by default - your select at 9am would either hit the record before it was committed - get blocked and then read it, or it would hit it after I commit and since they by default do not do a consistent read - you would just read whatever you saw there.


Same logic, different outcome. 
5 stars   February 1, 2010 - 11am Central time zone
Reviewer: John Gilmore from London, England
...
I showed him a piece of logic that works provably in sqlserver - a bit of logic to pull changes from one system to another.
I showed him how that same logic would not work correctly in Oracle - using any isolation mode, the LOGIC had to change
...


I was just wondering if you could share the details of the logic with us, I'd be very interested to see it.

Followup   February 1, 2010 - 12pm Central time zone:

there are an infinite number of examples..... Just think of any multiuser situation with concurrent reads and writes of data.

Here is another one: reporting on the amount of money in the bank. What works very well in Oracle doesn't work at all in Sqlserver (by default)

see
http://www.oracle.com/technology/oramag/oracle/05-nov/o65asktom.html


5 stars natural keys   February 1, 2010 - 8pm Central time zone
Reviewer: A reader 



Write a Review
 


All information and materials provided here are provided "as-is"; Oracle disclaims all express and implied warranties, including, the implied warranties of merchantability or fitness for a particular use. Oracle shall not be liable for any damages, including, direct, indirect, incidental, special or consequential damages for loss of profits, revenue, data or data use, incurred by you or any third party in connection with the use of this information or these materials.

About Oracle | Legal Notices and Terms of Use | Privacy Statement