well, i had to laugh at this:
...
Globally unique identifiers are known to work well in the source OLTP systems, but they are difficult to use when it comes to data warehouses. This is primarily because of two reasons:
Globally unique identifiers use a significant amount of space compared to their integer counterparts. Globally unique identifiers take about 16 bytes each, where an integer takes about 4 bytes.
Indexes on globally unique identifiers columns are relatively slower than indexes on integer keys because globally unique identifiers are four times larger.Globally unique identifiers use a significant amount of space compared to their integer counterparts. Globally unique identifiers take about 16 bytes each, where an integer takes about 4 bytes.
....
first, in todays data warehouses - billions of rows are standard, we have warehouses that add billions of rows per week or month. Think about what happens if you use a 4 byte integer as your surrogate. Whoops - we can only load around 2-4 billion records before we die. What about the system that needs to load 8 billion rows a month (or more - things are growing, not shrinking). Using a 4 byte integer is just silly today.
Also to say an index would be slower - not really - the speed of an index is relative to its height (number of IOs from top to leave), not the width of the key. And height is not directly relative to the width of the key.
assuming you use some reasonable partitioning scheme, the height isn't going to be significantly different - if at all:
ops$tkyte%ORA11GR2> create table t ( x number not null, y raw(16) not null );
Table created.
ops$tkyte%ORA11GR2> create table t2 segment creation immediate as select * from index_stats;
Table created.
ops$tkyte%ORA11GR2> create sequence s cache 100000;
Sequence created.
ops$tkyte%ORA11GR2>
ops$tkyte%ORA11GR2> create index t1_idx on t(x);
Index created.
ops$tkyte%ORA11GR2> create index t2_idx on t(y);
Index created.
ops$tkyte%ORA11GR2> pause
ops$tkyte%ORA11GR2>
ops$tkyte%ORA11GR2> begin
2 loop
3 insert /*+ append */ into t select s.nextval, sys_guid() from dual connect by level <= 100000;
4 commit;
5 end loop;
6 end loop;
7 /
begin
*
ERROR at line 1:
ORA-01654: unable to extend index OPS$TKYTE.T1_IDX by 1024 in tablespace USERS
ORA-06512: at line 3
ops$tkyte%ORA11GR2> select count(*) from t;
COUNT(*)
----------
24900000
ops$tkyte%ORA11GR2>
ops$tkyte%ORA11GR2> analyze index t1_idx validate structure;
Index analyzed.
ops$tkyte%ORA11GR2> insert into t2 select * from index_stats;
1 row created.
ops$tkyte%ORA11GR2> analyze index t2_idx validate structure;
Index analyzed.
ops$tkyte%ORA11GR2> insert into t2 select * from index_stats;
1 row created.
ops$tkyte%ORA11GR2> select name, height, blocks, lf_rows from t2;
Tablespace Name HEIGHT BLOCKS LF_ROWS
------------------------------ ---------- ---------- ----------
T1_IDX 3 53248 24900000
T2_IDX 3 89088 24900000
it will take the same amount of work to retrieve a key from either of those indexes - yes, one takes more space but the performance will be just about the same. And even if the height of the GUID index was say 3 times higher - it still wouldn't matter practically. Unless you use the index to get a *single row* (not as typical in a warehouse as in OLTP) - most of the IO's are against the table. Say you used the index to get 100 rows - then the index with height 3 would do about 103 IO's where as the other would do about 109 - big deal.
Their first reason for not using a natural key:
"Data tables in various OLTP source systems may use different keys for the same entity"
is a data model issue, they are basically saying "for this new system we are building (the consolidated system) we don't have a natural key". Ok, if you don't have a natural key for a set of data - go for a surrogate, no one is arguing that, makes sense.
then they say "Surrogate keys provide the means to maintain data warehouse information when dimensions change"
change surrogate to natural and the sentence is still true. You can use a natural key (assuming one exists). Surrogates are NOT NECESSARY as they state (without any evidence). If you have a natural key - it has all of the attributes of a surrogate, so if a surrogate works, a natural key by definition works! A surrogate cannot be necessary if a natural exists.
then they say "Natural OLTP system keys may change or be reused in the source data systems"
well, I'm sorry but if the natural key changes, it wasn't a natural key by definition. Move along, nothing to see here.
they they say "Improve performance of queries"
without a single number to prove it. Bear in mind, Oracle doesn't have any 4 byte integer types (thankfully). Our numbers are packed decimal like and range in length from 0 to 22 bytes. Not true.
This is just gibberish - I don't understand the context:
Handle exception cases
If requirements need to be determined or do not apply, use a surrogate key.????? it seems to say "if you do not know what you are doing, use a surrogate", that cannot be right.