Skip to Main Content
  • Questions
  • Surrogate keys and their use in Data warehousing

Breadcrumb

Question and Answer

Tom Kyte

Thanks for the question, Danish.

Asked: August 05, 2006 - 12:14 pm UTC

Last updated: September 23, 2013 - 5:41 pm UTC

Version: Oracle 9i

Viewed 10K+ times! This question is

You Asked

Hello.

I want to know what is the purpose of surrogate keys in data warehouses and how are they used in developing data warehouses. Especially their use in look up tables where mapping tables are made to resolve slowly changing dimensions problem of type 2. thanks

and Tom said...

Ah, the perennial "surrogate" key debate.

The purpose is simple:

to provide an application generated unique key value for each record, in place of the natural key of the record.

Sometimes records do not have natural keys.
Sometimes they do, but they are not "reliable" for whatever reason.


I've always been a fan of using NATURAL keys when common sense says "it is ok to use the natural key". (eg: the natural key is not 50 columns, or, the natural key is actually imutable and consists of a small number of columns, or, ....)

And using SURROGATES when no natural key exists. I use surrogates in this very application. There is no natural key for a question - hence I use a sequence (and other stuff to randomize it and just in general munge it up) to generate one.

You can go overboard with surrogates (eg: does a DATE_DIM table that has an entry for every day in a year really need a surrogate? or would the DATE itself suffice...).

You can go overboard with natural keys as well.

I'll publish this and just sit back and watch the opinions :)

Rating

  (14 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Debate?

A reader, August 05, 2006 - 9:52 pm UTC

Hi Tom,
Whilst I am in favour of natural keys, but at some points, I was in a situation where surrogate keys had some advantage. From the source system (outside our own systems), they had policy number, which they decided to change i.e., client number will be same as policy number from some point onwards. If we had policy number as part of primary key (client number + policy number), then we need to update PK (Horrible idea!). Since we generated surrogate key for each record, it was easy for us to change the policy number to reflect this change!, as for analytics purposes we want to use policy number whether client's policy has churned or not?
Regards


Tom Kyte
August 06, 2006 - 8:58 am UTC

That is an example of a "natural key that isn't a natural key because natural keys are IMMUTABLE - this key you are calling a key isn't immutable, hence it is not the natural key"

So here, a surrogate would be called for.

What the last reader said

David Weigel, August 05, 2006 - 10:53 pm UTC

Natural keys that seem immutable today have a habit of changing eventually. Vendors and other companies merge, employees change names and IDs, financial systems change (mucking up purchase orders and accounts), part numbers change, currencies change, social security numbers change (think identity theft), country names change, street names change. Not many things never change; maybe mathematical constants and the calendar. (But pity the poor 16th century programmers who had to rework their apps when Pope Gregory scratched ten days from October 1562.)

input errors

Alberto Dell'Era, August 06, 2006 - 6:57 am UTC

I agree with David Weigel above, and add input errors - since an operator can (and so will ...) make an input mistake, if I use, say, an "immutable" SSN as a PK, I am forced to write a routine to change it in every table that references it through a FK, cascading to (grand)^N-children also.

And maintain it - and it's so easy to add another grandchild forgetting to update that rarely-used routine.

In the context of an OLTP feeding a DWH, this is even more an issue, since you'd have to propagate the update there too, and the DWH is another database, maintained by a different team almost always ...

Then the DWH produces reports, and so you'll have to explain to management that that Gold customer named "Julyan" is still with us, just renamed "Julian" ... that is, propagate the update to the paper too.

So, much better to use surrogate keys in general In My Humble Opinion, even if course for some systems it may make sense to use natural keys.

what about star transformation?

Jon, August 07, 2006 - 12:11 pm UTC

1. Bitmap indexes
In my experience, in a warehouse you are typically getting data sourced from several locations so most natural keys are comprised of the source id and then their id.

table customer_dim
source_id (pk)
cust_id (pk)

Bitmap transformation requires single-column bitmap index on every join column on the fact so it would be better to do this:

table customer_dim
customter_id (pk)
source_id
cust_id

unique key on (source_id, cust_id)

2. Conditions
I thought you always wanted conditions to be placed on dimensions, not the fact tables for the best performance of a star schema. Using a surrogate key forces users to filter on the dimension. You could put the date on the fact table but then you have nothing to transform and then it wouldn't use the bitmap index on the fact.

Surrogate keys with SCD type 2 dimensions

Pratap, August 31, 2006 - 3:34 pm UTC

If I have SCD type 2 dimensions (Say parent as well as child in a snowflake) then surrogate keys can be usefull to maintain a parent child relationship between the tables. The surrogate key alone can be used to join the two tables together. Such a join will result in fetching data from both table valid for the same point in time.

But it can be a bit clumsy. Say a parent table has a row -
Parent_SK_1 Attribute1 01-Jan-2006 31-Jan-2006

Child has the following row -
Child_SK_1 Parent_SK_1 Child_NK_1 01-Jan-2006 31-Jan-2006

Suppose only the child record undergoes a change, then a new records has to be added in the parent as well as child table -

Parent -

Parent_SK_1 Parent_NK_1 Attribute1 01-Jan-2006 31-Jan-2006
Parent_SK_2 Parent_NK_1 Attribute1 01-Feb-2006 NULL

Child -
Child_SK_1 Parent_SK_1 Child_NK_1 01-Jan-2006 31-Jan-2006
Child_SK_2 Parent_SK_2 Child_NK_1 01-Feb-2006 NULL

Can anyone please validate my understanding of implementing surrogate keys for SCD Type 2 dimensions in a snowflake?


Tom Kyte
August 31, 2006 - 7:04 pm UTC

not sure what you mean here? why are the dates there and correlated with eachother?

Sorry this should make it clear -

Pratap, September 01, 2006 - 11:00 am UTC

Suppose we have a table called product_master. It has got a child called product_detail. The relationship between the tables is maintained using product_key_sk. Both are SCD Type 2.

I insert a row in product_master -

product_key_sk natural_key product_name valid_from valid_to
_____________ ____________ __________ ___________ ________
parent_sk_1 parent_nk_1 p1 01-Jan-2006 null


I insert a corresponding row in product_detail -

product_key_sk valid_from valid_to Other columns
_____________ ___________ ________ _____________
parent_sk_1 01-Jan-2006 null

Now suppose the product_name in product_master gets updated from p1 to p2

product_key_sk natural_key product_name valid_from valid_to
_____________ ____________ __________ ___________ ________
parent_sk_1 parent_nk_1 p1 01-Jan-2006 31-Jan-2006
parent_sk_2 parent_nk_1 p2 01-Feb-2006 null

The detail record remains unchanged. Now if I query product_master and product_detail like this -

select *
from product_master m,
product_detail d,
where m.product_key_sk = d.product_key_sk
and m.valid_from_date >= 01-Feb

This query will not fetch any record as there is no row for parent_sk_2 in the detail table. And my join between the two tables on product_key_sk is valid because that is how the relationship is defined. So to resolve this problem I need to insert another row in product_detail even though the product_detail table is not changed -

product_key_sk valid_from valid_to Other columns
_____________ ___________ ________ _____________
parent_sk_1 01-Jan-2006 31-Jan-2006
parent_sk_2 01-Feb-2006 null

Is this approach correct, if not how surrogate keys are maintained in a datawarehouse with a snowflake schema and SCD Type 2 dimensions.


Tom Kyte
September 01, 2006 - 11:09 am UTC

I don't understand yet why the parent is getting a new surrogate.

using your current approach you would have to of course copy all functionally dependent data - that is your choice.

Why new surrogate for parent

Pratap, September 01, 2006 - 11:24 am UTC

The master is getting a new surrogate because it is the primary key of the table. The tabe is a SCD Type 2 and we need to keep history as well as current data. To my understanding this is the approach that is used in a datawarehouse.

Tom Kyte
September 01, 2006 - 11:26 am UTC

I don't see why you need an entirely new number. That is your design choice.


the combination of the "sk" and the effective date is sufficient to uniquely identify a row in that table. And the "sk" is all the detail needs.

it is up to you. If you re-"sk" the data, you'll obivously have to carry that all of the way down.

Surrogate key for aggregates

Austin, June 21, 2007 - 8:14 am UTC

Hi Tom,

I work in a reporting module where we collect the data from various systems.The requirement then is to create reports for variuos level aggregates i.e. day,week,month,quarter,year.Day level table contains day level aggregated count,month level table contains month level aggregated count and so on

Question 1.

Do we need to create seperate tables for varoius level aggregates or is it better to have day level aggregate and just aggregate for various other level aggregates.Is there any other way which is better than this

Question 2.
There is only one dimension in all these aggregate table.should I create a surrogate key for this dimension.

The question below are generic questions.

Question 3.
Do you agree that surrogate key helps in saving space in the fact tables.

Question 4.
I read somewhere that when there are a few dimensions in the fact tables it is better to create a bitmap index on these dimensions.It improves the performance of the query.

Regards
Tom Kyte
June 21, 2007 - 10:55 am UTC

q1: of course the only sensible answer for this is "it depends"

given that a month is just 31 records maximum, it would probably be safe to say that "you can get months from days pretty easy" - unless of course you report on lots of months in which case you are now processing 31 times the number of records.

and so on. It is highly likely that if you take hundreds of thousands of daily records and aggregate them into a single daily observation - all of your other aggregates could be computed on the fly from this one aggregate. probably.

do the math.


q2: eh?

q3: no. it depends on the natural key values of course. Are they larger or smaller than your surrogate values.

q4: geez. where is the science. If this hunch is true - you would be able to come up with at least one use case that applies to you and makes it obviously true or not.

what is "few"
why only a "few" - is this true when there are a lot? if it is - why say few? if it isn't - why not?
WHAT QUERY?????? all queries? some queries? a specific query?

More about surrogates

Marco, June 21, 2007 - 4:23 pm UTC

I find the opinion at http://www.datraverse.com/technology/sql.php very useful too

David Aldridge, June 22, 2007 - 6:59 pm UTC

>> q3: no. it depends on the natural key values of course. Are they larger or smaller than your surrogate values.

... and if you use data segment compression and the rows in the table are reasonably ordered by the key column(s) then the relative length becomes practically insignificant.

Utkal Ranjan, March 16, 2012 - 1:48 am UTC

Hi tom,

please explain me what is use of surrogate key? and how it differs from natural key.

Thanks

Utkal
Tom Kyte
March 16, 2012 - 8:38 am UTC

A surrogate key is used when no natural key exists. It is a 'substitute key'.

For example, given a Human Resources system - you might have an EMP table.

What is the natural key of an employee. It cannot be their name, that is not unique. It cannot be their Social Security number, that can change. There is in fact no true 'natural key'.

So, we assign them a surrogate key - an artificially generated EMPNO - employee number.

immutable Primary key or immutable Referenced Columns ?

Franck Pachot, March 16, 2012 - 11:46 am UTC

Hi Tom,

Debates about surrogate vs. natural are always talking about primary keys, and the fact that they should be immutable.

But the real problem is not because it is a primary key. The problem comes from the fact that those key columns are referenced by a foreign key. And we want to avoid cascading updates issue.

But you can reference columns that are not the primary key. You only need them to be unique.
And you can have a primary key that is not referenced by anything. And that one can be mutable.

As far a I know, being the *primary* key or just a unique/notnull constraint has a consequence only when we have an IOT. 'Primary' means that it is used to physically organize rows. And it has no consequence about referential integrity. Am I right ?

Regards,
Franck.
Tom Kyte
March 16, 2012 - 12:10 pm UTC

there is, quite simply, nothing to debate:


If (the key is not immutable)
THEN
   that is not the key, it is maybe UNIQUE, but it sure isn't the KEY!!!
END IF




If what you thought was your natural key was in fact something that was going to be updated - then it is not your natural key and you might well need a surrogate key. Just like my social security number example above.

Does this hold good for Oracle as well

A reader, February 04, 2013 - 11:59 am UTC

Tom Kyte
February 06, 2013 - 7:21 am UTC

well, i had to laugh at this:

...
Globally unique identifiers are known to work well in the source OLTP systems, but they are difficult to use when it comes to data warehouses. This is primarily because of two reasons:

Globally unique identifiers use a significant amount of space compared to their integer counterparts. Globally unique identifiers take about 16 bytes each, where an integer takes about 4 bytes.

Indexes on globally unique identifiers columns are relatively slower than indexes on integer keys because globally unique identifiers are four times larger.Globally unique identifiers use a significant amount of space compared to their integer counterparts. Globally unique identifiers take about 16 bytes each, where an integer takes about 4 bytes.
....


first, in todays data warehouses - billions of rows are standard, we have warehouses that add billions of rows per week or month. Think about what happens if you use a 4 byte integer as your surrogate. Whoops - we can only load around 2-4 billion records before we die. What about the system that needs to load 8 billion rows a month (or more - things are growing, not shrinking). Using a 4 byte integer is just silly today.

Also to say an index would be slower - not really - the speed of an index is relative to its height (number of IOs from top to leave), not the width of the key. And height is not directly relative to the width of the key.

assuming you use some reasonable partitioning scheme, the height isn't going to be significantly different - if at all:



ops$tkyte%ORA11GR2> create table t ( x number not null, y raw(16) not null );

Table created.

ops$tkyte%ORA11GR2> create table t2 segment creation immediate as select * from index_stats;

Table created.

ops$tkyte%ORA11GR2> create sequence s cache 100000;

Sequence created.

ops$tkyte%ORA11GR2> 
ops$tkyte%ORA11GR2> create index t1_idx on t(x);

Index created.

ops$tkyte%ORA11GR2> create index t2_idx on t(y);

Index created.

ops$tkyte%ORA11GR2> pause

ops$tkyte%ORA11GR2> 
ops$tkyte%ORA11GR2> begin
  2          loop
  3                  insert /*+ append */ into t select s.nextval, sys_guid() from dual connect by level <= 100000;
  4                  commit;
  5          end loop;
  6  end loop;
  7  /
begin
*
ERROR at line 1:
ORA-01654: unable to extend index OPS$TKYTE.T1_IDX by 1024 in tablespace USERS
ORA-06512: at line 3


ops$tkyte%ORA11GR2> select count(*) from t;

  COUNT(*)
----------
  24900000

ops$tkyte%ORA11GR2> 
ops$tkyte%ORA11GR2> analyze index t1_idx validate structure;

Index analyzed.

ops$tkyte%ORA11GR2> insert into t2 select * from index_stats;

1 row created.

ops$tkyte%ORA11GR2> analyze index t2_idx validate structure;

Index analyzed.

ops$tkyte%ORA11GR2> insert into t2 select * from index_stats;

1 row created.

ops$tkyte%ORA11GR2> select name, height, blocks, lf_rows from t2;

Tablespace Name                    HEIGHT     BLOCKS    LF_ROWS
------------------------------ ---------- ---------- ----------
T1_IDX                                  3      53248   24900000
T2_IDX                                  3      89088   24900000


it will take the same amount of work to retrieve a key from either of those indexes - yes, one takes more space but the performance will be just about the same. And even if the height of the GUID index was say 3 times higher - it still wouldn't matter practically. Unless you use the index to get a *single row* (not as typical in a warehouse as in OLTP) - most of the IO's are against the table. Say you used the index to get 100 rows - then the index with height 3 would do about 103 IO's where as the other would do about 109 - big deal.


Their first reason for not using a natural key:
"Data tables in various OLTP source systems may use different keys for the same entity"

is a data model issue, they are basically saying "for this new system we are building (the consolidated system) we don't have a natural key". Ok, if you don't have a natural key for a set of data - go for a surrogate, no one is arguing that, makes sense.



then they say "Surrogate keys provide the means to maintain data warehouse information when dimensions change"

change surrogate to natural and the sentence is still true. You can use a natural key (assuming one exists). Surrogates are NOT NECESSARY as they state (without any evidence). If you have a natural key - it has all of the attributes of a surrogate, so if a surrogate works, a natural key by definition works! A surrogate cannot be necessary if a natural exists.


then they say "Natural OLTP system keys may change or be reused in the source data systems"

well, I'm sorry but if the natural key changes, it wasn't a natural key by definition. Move along, nothing to see here.



they they say "Improve performance of queries"

without a single number to prove it. Bear in mind, Oracle doesn't have any 4 byte integer types (thankfully). Our numbers are packed decimal like and range in length from 0 to 22 bytes. Not true.


This is just gibberish - I don't understand the context:

Handle exception cases
If requirements need to be determined or do not apply, use a surrogate key.


????? it seems to say "if you do not know what you are doing, use a surrogate", that cannot be right.

unknown

unknow, September 11, 2013 - 10:41 am UTC

Hello tom,
My boss told me to create the table with surrogate key all we are new in data warehouse project but they are not going to give any kind of training stuff...
what is the syntax to create table with this " SK "
please help me....

DW_CUSTOMER
DW_CUSTOMER_ADDRESS and so on.i have so many tables that should be joined by customer_id.
Please provide me syntax of this


Tom Kyte
September 23, 2013 - 5:41 pm UTC

really?

a surrogate key is just a unique key you make up yourself.

a sequence is many times used for this. insert sequence.nextval