Slowly changing dimensions type 2
A reader, February  15, 2004 - 2:57 pm UTC
 
 
Pradhan,
You are metion "Type 2 Dimension", do you mean by that slowly changing dimensions type 2 (SCD 2)?
A short description of SCD2:
Using Type 2, a new record with the new attributes is added to the dimension table. Historical fact table rows continue to reference the old dimension key with the old roll-up attribute; going forward, the fact table rows will reference the new surrogate key with the new roll-up thereby perfectly partitioning history (from www.rkimball.com/html/designtips/2000/designtip15.html) 
So with SCD 2 you would keep the entire history of your dimension in the dimension. Fact data from August 2000 will for ever point to dimension attributes which were valid in August 2000 and fact data from May 2003 will for ever point to dimension attributes valid in May 2003. It's possible that that attribute didn't change between August 2000 and May 2003 so it would be the same dimension record but in case something changed it would be two different dimension records.
So if you are using SCD 2 why would you move your old row to a history table? When a query is executed against your mart, how should the DB know when it has to look in the history table and when in the actuall dimension?
Regards
Maurice
 
 
 
alternative
Ferenc Palyi, February  16, 2004 - 4:31 am UTC
 
 
(E.f1 <> S.f1 or (E.f1 is null and S.f1 is not null) 
                      or (E.f1 is not null and S.f1 is null))
An alternative for the above statement:
decode(e.f1, s.f1, 0, 1) = 1 
 
February  16, 2004 - 7:38 am UTC 
 
you'll find the original to be marginally faster -- i prefer the verbose approach here. 
 
 
 
But if you can't replace the table
Peter, February  16, 2004 - 8:57 am UTC
 
 
If you use materialized views and have logging over the dimension table you may not be able to replace the dimension table without triggering a large amount of MV changes.
You could consider using a set based check on the loaded table to ignore identical rows. As you state Oracle 8.1.7 you will not have access to MERGE (shame, as that alone is well worth the move to 9.2) so you will need to process the updates and inserts separately.
Steps to follow:
1) load the data file to a STAGE area
2) insert into a CHANGES table all the rows that are NOT identical to those in the DIMENSION table (use something like Select * from STAGE MINUS select * from DIMENSION).
3) update all rows that exist in both the DIMENSION and CHANGES tables
4) insert all the rows from the CHANGES table that do not exist in the DIMENSION.
For simplicity I have put the updates first then the inserts - you may find that for efficiency you should go insert then update. But if you had Oracle 9 then use Merge as it 1)set based 2)cool
For one of 8i data warehouses we attempt to update all the DIMENSION records we have a changes value for and use exception handling to process the inserts when the we get a missing dimension value. Our code to this is row by row using a cursor but performs quite well  
 
February  16, 2004 - 9:12 am UTC 
 
they made no mention whatsoever of other stuff.  
If they come up with other constraints, I'd be glad to address them... 
 
 
 
 
Type 2 Dimension
A reader, February  17, 2004 - 4:25 pm UTC
 
 
Thanking you very much. 
 
 
Rapidly changing monster dimensions
A reader, September 21, 2004 - 8:04 pm UTC
 
 
I dont want to get bogged down with semantics or buzzwords here with how Kimball defined things, but we have the following problem:
A million row table containing a bunch of numbers that we wish to be able to provide access to the data for 30 days. Only about 5% of the data changes from day to day
Without sacrificing query performance, how can I provide the logical equivalent of having a 30million row table but store only the delta from day to day?
[The textbook technique for handling such rapidly changing monster dimensions suggests to create another table with the  attributes that change less frequently and create "bands" from them. Thats not an option here since all my numeric values are continuous values, cannot be banded into discrete ranges]
Help? Thanks 
 
 
Is this approach right?
A reader, June      20, 2006 - 9:40 am UTC
 
 
Hi,
We have a set of pl/sql procedures which do the ETL process for our data warehouse. Is it a good design to have a table which will list all the avaialable procedures, their active/inactive states and the sequence of execution. I can then have a main package which will call the procedures to be executed in the sequence defined. This way any new procedures introduced at a later point can be easily plugged into the ETL process. 
Is this approach correct or is there a better way of doing this? Please suggest.
Thanks in advance.
 
 
June      20, 2006 - 10:13 am UTC 
 
depends on how frequently this is executed.  dynamic sql is "neat" but generic is typically less efficient.  but if this is infrequently executed - probably "ok" in this case. 
 
 
 
Best approach to update a SCD Type 2 table ? 
Philippe, September 14, 2006 - 2:46 am UTC
 
 
Hi Tom,
We have a data warehouse in an Oracle 10.2 database... and we need to update SCD Type 2 dimension tables.
The algorithm to update the table is :
FOR ALL ROWs in the SOURCE query LOOP
  IF ( a corresponding row exists in the DIM TABLE 
  THEN 
    IF data are different THEN
      1. UPDATE the "CURRENT" row of the DIM TABLE 
           SET VALID_TO = NOW
               CURRENT = 'N'
      2. INSERT new version into DIM TABLE 
    END IF
  ELSE
    INSERT new row into DIM TABLE 
  END IF
I don't think the algorithm you suggest in your first follow up is appropriate to our situation.
Of course, implementing this row by row in PL/SQL is not the  fastest approach.
I see two approaches :
  - express this in a single MERGE sql statement (but I don't see how)
  - use PL/SQL FETCH BULK COLLECT...
Which is the best approach ?
Is there another way to do this ?
NB : - at the first run, all rows will be NEW => inserted.
     - for the next runs, only a few lines will have to be inserted and updated (less than 5% of the DIM table)
Thanks in advance for your advice !
Philippe 
 
September 14, 2006 - 9:10 am UTC 
 
or a single update statement to retire existing rows and a single insert to insert new rows. 
 
 
 
Best approach to update a SCD Type 2 table ?
A reader, September 14, 2006 - 9:32 am UTC
 
 
If the "source query" is a complex view on a remote database... which of the approaches would be the best one.
In your suggested one, this view would have to be queried twice ...
 
 
September 14, 2006 - 9:55 am UTC 
 
insert into gtt 
select complex query;
do update from gtt;
do insert from gtt;
is one approach. 
 
 
 
how to do this using merge
Reene, December  24, 2007 - 3:41 am UTC
 
 
drop table t1;
drop table t2;
create table t1 as select object_id,object_name,trunc(sysdate)-1 created,trunc(sysdate)-1 updated from all_objects where rownum < 5;
create table t2 as select * from t1 ;
exec dbms_stats.gather_table_stats(user,'T1');
exec dbms_stats.gather_table_stats(user,'T2');
insert into t2 values (-1,'TEST',trunc(sysdate),trunc(sysdate)) ;
 
Now I want to merge t2 into t1 using a merge statement.
But I have a slightly different requirement - 
1. if an object_id exists in t2 (e.g all rows except the one with object_i = -1) then i should update
the existing rows as updated = trunc(sysdate)
and then insert the same row(i.e the row which I am going to update) as updateed = to_date('01-01-2199','mm-dd-yyyy') ; 
so the expected ouput data in t1 is
 OBJECT_ID OBJECT_NAME                    CREATED         UPDATED
---------- ------------------------------ --------------- ---------------
        17 FILE$                          23-DEC-07       01-JAN-2199
        51 BOOTSTRAP$                     23-DEC-07       01-JAN-2199
        25 I_COBJ#                        23-DEC-07       01-JAN-2199
        44 I_CON2                         23-DEC-07       01-JAN-2199
        17 FILE$                          23-DEC-07       24-DEC-2007
        51 BOOTSTRAP$                     23-DEC-07       24-DEC-2007
        25 I_COBJ#                        23-DEC-07       23-DEC-2007
        44 I_CON2                         23-DEC-07       24-DEC-2007
        -1 TEST                           24-DEC-07   24-DEC-2007        
I know this merge statement will update or insert but I need
update and innsert (if matched)
or 
insert  (if not matched)
merge into t1
using (select * from t2) t
on (t1.object_id = t.object_id)
when matched then
update set  t1.object_name=t.object_name
when not matched then
insert  values (t.object_id,t.object_name)
/
appreciate your help. 
December  24, 2007 - 9:17 am UTC 
 
you would not be using merge to update using a record and insert that record.
that is not what merge does. 
 
 
is there another way to do this
Reene, December  24, 2007 - 9:23 am UTC
 
 
Hi Tom
yes merge don;t work like this,but is it another better way to do that using a single SQL not pl/sql.
Thanks 
December  24, 2007 - 9:28 am UTC 
 
merge is the only thing that can insert OR update.
but it won't do both with the same source row.
insert inserts, update updates, delete deletes, merge merges - that is all...
you will update
and then insert
 
 
 
Zafar, May       22, 2009 - 4:03 am UTC
 
 
Hi Tom,
I am building ETL for a SCD Type 2 dimension with a similar approach that you stated (First update to expire the existing rows in Dimension and then Merge (using merge, instead of Insert, is appropriate with my particular situation).
The only concern that I have is that if these two operations are not in unison i.e. for e.g. if the update executes and merge/insert fails, their will be data issues in the target table. Any suggesstion on how to handle this? Also, what can be a suggested recovery mechanism in case an error occurs in execution in one of the steps? 
May       23, 2009 - 10:18 pm UTC 
 
we call this "a transaction"
a series of SQL statements either entirely succeed or entirely fail.
so, update to expire
then merge to refresh data
THEN COMMIT <<<=== this makes both of the prior statements permanent
if error then ROLLBACK <<<=== this makes any work of the above two statements go away.
Truth be told, if you have to ask this, you have a lot of reading to do before you can safely touch the database - suggest you start with the concepts guide and read it from star to finish.  Don't take that wrong, but transactions are sort of the core foundation upon which everything else is built - if you didn't know about them, you have some learning to do before you start programming production applications. 
 
 
Dimensional Indexes
A reader, June      07, 2009 - 5:44 am UTC
 
 
Hi Tom,
Thanks for suggesting the approach of merging the delta with the origional dimension.
We have a dimension which is partitioned by is_current_flag where history (type 2) is maintained in a separate partition and the current records are in a separate partition which are the ACTIVE records.
On a daily basis we recieve 1.5 million rows as the delta out of which 20% are new and rest are type 1 or type 2 changes.
The delta records have to merge with 304 million records and appropriately sit in either historical or current partition. 
For updates to type 2 we generate a new surrogate key. We have bitmaps on 8 columns for this dimension.
Curently we are creating another TEMP table out of the current dimension. Apply the new inserts and merge as transformation. Build all bitmaps indexes on the TEMP table. 
The last step is do EXCHANGE PARTITION with the existing dimension.However the time spent to create the indexes consumes about 4 hrs which is too much.
I was thinking to apply a merge statement to the existing dimension as the delta set is too small. Should update to 1.5 million rows with BITMAP INDEXES on 8 columns looks feasible solution or it can become worse.
What other solutions you can propose, appreciate your response.     
 
Dimensional indexes
A reader, June      09, 2009 - 11:40 pm UTC
 
 
Hi Tom,
Can you please answer to the above question. 
June      10, 2009 - 11:56 am UTC 
 
didn't really have anything to say, so I didn't.
Ok, i'll say one thing:
... I was thinking to apply a merge statement to the existing dimension as the delta set is too small. 
Should update to 1.5 million rows with BITMAP INDEXES on 8 columns looks feasible solution or it 
can become worse.
 ...
in a word: benchmark.
if you do it in bulk, it can be very efficient.  if you do it slow by slow, it'll kill you.
but in the end, you have to benchmark it. 
 
 
Dimensional indexes
A reader, June      10, 2009 - 12:03 pm UTC
 
 
Thanks Tom,
Merge itself being a bulk operation and not row by row approach will be benchmarked.
Though one question will the indexes be updated in bulk after the operation is complete or the index maintenance will be for every row that touches the data block.
Thanks 
June      10, 2009 - 3:01 pm UTC 
 
we attempt to manage bitmap updates in bulk during that single sql statement, yes. 
 
 
SCD Type 2 implementation
Anamika, September 10, 2009 - 8:06 am UTC
 
 
Hi Tom,
I'm need to using the SCD Type2 concept regarding a table that gets new records as well as modified records once in a while, so inorder to handle this like i have to update the date feild, effective_thru for existing records and insert the new records as it is?
Can you help me in writing the right code for this as to how can i handle this?
 
September 14, 2009 - 11:12 am UTC 
 
update t set effective_thru = sysdate
where key in (select key from new_inputs);
insert into t ... select from new_inputs;
 
 
 
A reader, September 10, 2009 - 8:09 am UTC
 
 
Sorry i need to be written in PL/SQL code 
September 14, 2009 - 11:13 am UTC 
 
ummm
NO
YOU 
DON'T
there is no procedural code necessary, desired or otherwise appropriate here. 
 
 
My Solution for implementing SCD-2 using PL/SQL
Ranjeet Srivastava, September 20, 2009 - 1:11 am UTC
 
 
My approach here will be to help you create a Re-usable/ Common-Component to achieve SCD-2 implementation using 100% SQL and PL/SQL.
Step#1. First of all bring the DELTA from your source (I can discuss the ETL Strategy for Incremental Data load in separate session if you need/want).
Step#2. Create a control_table(Source_Sytem_ID, Source_Table_Name, Source_Column_Name, SCD_Type).
Step#3. List down all the Source Tables that needs SCD-2 or SCD-3, this is one time task to make it re-usable component, here you may need to consult you business people to freez this.
Step#4. Design your Target Data Warehouse Table along with following additional Columns (Effective_From, Effective_To, Active_Flag).
Step#5. For every Delta, based on the DML type (whether it is Insert or Update or Delete-generally soft delete becomes Update), check for its SCD qualification into the Control_Table (as created above). 
Step#6. If it qualifies for SCD-2 then first go to the target update the Effective_To column to the current date and change the Active_Flag to '0' zero ('0' means inactive, '1' means active).
Step#7. Insert the new updated record with Effective_From=Current Date and Effective_To='either null or some future distant date like 31-12-9999'. Simultaniously, update the Active_Flag to '1'.
For all these purpose, you can use Stored Procedure using PL/SQL.
Bulk Collect can help you achive it faster (Set the Limit, based on your requirement)
If you don't require Incremental Data Load approach, then you can follow following algorithm:==>
The algorithm to update the dimension table (having less no of records-few hundreds) in SCD-2 will be:
FOR ALL ROWs in the SOURCE query LOOP
  IF ( a corresponding row exists in the DIM TABLE 
  THEN 
    IF data are different THEN
      1. UPDATE the "CURRENT" row of the DIM TABLE 
           SET VALID_TO = NOW
               CURRENT = 'N'
      2. INSERT new version into DIM TABLE 
    END IF
  ELSE
    INSERT new row into DIM TABLE 
  END IFNB: You may use buld collect to achieve this faster.....in Oracle. 
 
Dimensional Query
Bhushan, October   12, 2009 - 5:22 pm UTC
 
 
Dear Thomas,
Very good information as usual.I need your suggestion/advice on the following.I know the best way would be benchmarking but then any leads on other alternatives will be helpful.
1) Table Structure
create table BRS_TRAN_DTL
(
  DEFAULTCODE           VARCHAR2(10) not null,
  TRAN_NO               VARCHAR2(12) not null,
  TRAN_CODE             VARCHAR2(10) not null,
  TRAN_DATE             DATE not null,
  SEQ                   NUMBER not null,
  COMPANY               VARCHAR2(10) not null,
  UNIT                  VARCHAR2(10),
  LOCATION              VARCHAR2(10),
  PARTY                 VARCHAR2(20),
  PARTY_TEXT            VARCHAR2(50),
  INV_NO                VARCHAR2(12),
  INV_DATE              DATE,
  PRD_GROUP             VARCHAR2(20) not null,
  PRODUCT               VARCHAR2(20) not null,
  PRODUCT_DTL           VARCHAR2(50) not null,
  PRODUCT_DTL_TEXT      VARCHAR2(1000),
  PRICE                 NUMBER,
  QTY                   NUMBER,
  UM                    VARCHAR2(10),
  AMOUNT_INV            NUMBER,
  BASE_CURRENCY         VARCHAR2(10),
  TRAN_CURRENCY         VARCHAR2(10),
  BASE_AMOUNT           NUMBER,
  TRAN_AMOUNT           NUMBER,
  ACCT_BALANCE          NUMBER,
  EXCH_RATE_TRAN_2_BASE NUMBER,
  VAT_AMT               NUMBER,
  WAYBILL_NO            VARCHAR2(10),
  BATCH_NO              VARCHAR2(10),
  SETTLED_YN            VARCHAR2(1),
  COMENT                VARCHAR2(120),
  STORE                 VARCHAR2(10),
  CREATED_BY            VARCHAR2(30),
  CREATED_DATE          DATE,
  SEALED_YN             VARCHAR2(1),
  DISCOUNT_PERCENT      NUMBER,
  TRAN_EFFECT           VARCHAR2(1) not null,
  MODIFIED_BY           VARCHAR2(30),
  MODIFIED_DATE         DATE,
  MAJOR_GROUP           VARCHAR2(20),
  STOCK_YN              VARCHAR2(1),
  DTL_REQ_YN            VARCHAR2(1),
  CSV_STRING            VARCHAR2(2000) not null,
  TOUNIT                VARCHAR2(10),
  FROMUNIT              VARCHAR2(10),
  TRAN_REF_NO           VARCHAR2(12),
  STN_QTY               NUMBER,
  CSV_SEQ               NUMBER,
  TRAN_EXCH_AMOUNT      NUMBER,
  BASE_EXCH_AMOUNT      NUMBER,
  BASE_PRICE            NUMBER,
  CSV_VALUE             NUMBER,
  STN_VALUE             NUMBER,
  CONS_STRING           VARCHAR2(2000),
  SUBGROUP              VARCHAR2(20),
  PERIOD                DATE,
  PU_DISCOUNT           NUMBER,
  EXPIRY_DATE           DATE,
  TRANSPORT_COST        NUMBER,
  EXP_STRING            VARCHAR2(100),
  CSV_STRING_BATCH      VARCHAR2(2000),
  CSV_VALUE_BATCH       NUMBER
)
Number of rows > 75000
For some reasons the developers thought creating more indexes is going to help so they ceated more than 15 indexes on various columns. Let us ignore them for now.
To continue further If on a particular date we need to calculate OPENING STOCK and current STOCK of various products do you think this will be a good approach
SELECT SUM (op_stk),SUM(curr_stk),SUM(period_stk),
product,
defaultcode
FROM(
SELECT 
CASE WHEN to_date(tran_date,'DD-MON-RRRR') <= to_date('30-jun-2008','DD-MON-RRRR') THEN
decode(tran_effect,'R',qty,'S',-qty) END op_stk,
CASE WHEN to_date(tran_date,'DD-MON-RRRR') > to_date)'30-jun-2008','DD-MON-RRRR') THEN
decode(tran_effect,'R',qty,'S',-qty) END period_stk,
decode(tran_effect,'R',qty,'S',-qty) curr_stk,
product,
defaultcode
FROM brs_tran_dtl )
GROUP BY product,
defaultcode
Any other way of getting this result will be helpful.
Kind Regards,
Bhushan 
 
Above Post
Bhushan, November  05, 2009 - 9:26 am UTC
 
 
Thanks for Ignoring i will come back with more details :)