Slowly changing dimensions type 2
A reader, February 15, 2004 - 2:57 pm UTC
Pradhan,
You are metion "Type 2 Dimension", do you mean by that slowly changing dimensions type 2 (SCD 2)?
A short description of SCD2:
Using Type 2, a new record with the new attributes is added to the dimension table. Historical fact table rows continue to reference the old dimension key with the old roll-up attribute; going forward, the fact table rows will reference the new surrogate key with the new roll-up thereby perfectly partitioning history (from www.rkimball.com/html/designtips/2000/designtip15.html)
So with SCD 2 you would keep the entire history of your dimension in the dimension. Fact data from August 2000 will for ever point to dimension attributes which were valid in August 2000 and fact data from May 2003 will for ever point to dimension attributes valid in May 2003. It's possible that that attribute didn't change between August 2000 and May 2003 so it would be the same dimension record but in case something changed it would be two different dimension records.
So if you are using SCD 2 why would you move your old row to a history table? When a query is executed against your mart, how should the DB know when it has to look in the history table and when in the actuall dimension?
Regards
Maurice
alternative
Ferenc Palyi, February 16, 2004 - 4:31 am UTC
(E.f1 <> S.f1 or (E.f1 is null and S.f1 is not null)
or (E.f1 is not null and S.f1 is null))
An alternative for the above statement:
decode(e.f1, s.f1, 0, 1) = 1
February 16, 2004 - 7:38 am UTC
you'll find the original to be marginally faster -- i prefer the verbose approach here.
But if you can't replace the table
Peter, February 16, 2004 - 8:57 am UTC
If you use materialized views and have logging over the dimension table you may not be able to replace the dimension table without triggering a large amount of MV changes.
You could consider using a set based check on the loaded table to ignore identical rows. As you state Oracle 8.1.7 you will not have access to MERGE (shame, as that alone is well worth the move to 9.2) so you will need to process the updates and inserts separately.
Steps to follow:
1) load the data file to a STAGE area
2) insert into a CHANGES table all the rows that are NOT identical to those in the DIMENSION table (use something like Select * from STAGE MINUS select * from DIMENSION).
3) update all rows that exist in both the DIMENSION and CHANGES tables
4) insert all the rows from the CHANGES table that do not exist in the DIMENSION.
For simplicity I have put the updates first then the inserts - you may find that for efficiency you should go insert then update. But if you had Oracle 9 then use Merge as it 1)set based 2)cool
For one of 8i data warehouses we attempt to update all the DIMENSION records we have a changes value for and use exception handling to process the inserts when the we get a missing dimension value. Our code to this is row by row using a cursor but performs quite well
February 16, 2004 - 9:12 am UTC
they made no mention whatsoever of other stuff.
If they come up with other constraints, I'd be glad to address them...
Type 2 Dimension
A reader, February 17, 2004 - 4:25 pm UTC
Thanking you very much.
Rapidly changing monster dimensions
A reader, September 21, 2004 - 8:04 pm UTC
I dont want to get bogged down with semantics or buzzwords here with how Kimball defined things, but we have the following problem:
A million row table containing a bunch of numbers that we wish to be able to provide access to the data for 30 days. Only about 5% of the data changes from day to day
Without sacrificing query performance, how can I provide the logical equivalent of having a 30million row table but store only the delta from day to day?
[The textbook technique for handling such rapidly changing monster dimensions suggests to create another table with the attributes that change less frequently and create "bands" from them. Thats not an option here since all my numeric values are continuous values, cannot be banded into discrete ranges]
Help? Thanks
Is this approach right?
A reader, June 20, 2006 - 9:40 am UTC
Hi,
We have a set of pl/sql procedures which do the ETL process for our data warehouse. Is it a good design to have a table which will list all the avaialable procedures, their active/inactive states and the sequence of execution. I can then have a main package which will call the procedures to be executed in the sequence defined. This way any new procedures introduced at a later point can be easily plugged into the ETL process.
Is this approach correct or is there a better way of doing this? Please suggest.
Thanks in advance.
June 20, 2006 - 10:13 am UTC
depends on how frequently this is executed. dynamic sql is "neat" but generic is typically less efficient. but if this is infrequently executed - probably "ok" in this case.
Best approach to update a SCD Type 2 table ?
Philippe, September 14, 2006 - 2:46 am UTC
Hi Tom,
We have a data warehouse in an Oracle 10.2 database... and we need to update SCD Type 2 dimension tables.
The algorithm to update the table is :
FOR ALL ROWs in the SOURCE query LOOP
IF ( a corresponding row exists in the DIM TABLE
THEN
IF data are different THEN
1. UPDATE the "CURRENT" row of the DIM TABLE
SET VALID_TO = NOW
CURRENT = 'N'
2. INSERT new version into DIM TABLE
END IF
ELSE
INSERT new row into DIM TABLE
END IF
I don't think the algorithm you suggest in your first follow up is appropriate to our situation.
Of course, implementing this row by row in PL/SQL is not the fastest approach.
I see two approaches :
- express this in a single MERGE sql statement (but I don't see how)
- use PL/SQL FETCH BULK COLLECT...
Which is the best approach ?
Is there another way to do this ?
NB : - at the first run, all rows will be NEW => inserted.
- for the next runs, only a few lines will have to be inserted and updated (less than 5% of the DIM table)
Thanks in advance for your advice !
Philippe
September 14, 2006 - 9:10 am UTC
or a single update statement to retire existing rows and a single insert to insert new rows.
Best approach to update a SCD Type 2 table ?
A reader, September 14, 2006 - 9:32 am UTC
If the "source query" is a complex view on a remote database... which of the approaches would be the best one.
In your suggested one, this view would have to be queried twice ...
September 14, 2006 - 9:55 am UTC
insert into gtt
select complex query;
do update from gtt;
do insert from gtt;
is one approach.
how to do this using merge
Reene, December 24, 2007 - 3:41 am UTC
drop table t1;
drop table t2;
create table t1 as select object_id,object_name,trunc(sysdate)-1 created,trunc(sysdate)-1 updated from all_objects where rownum < 5;
create table t2 as select * from t1 ;
exec dbms_stats.gather_table_stats(user,'T1');
exec dbms_stats.gather_table_stats(user,'T2');
insert into t2 values (-1,'TEST',trunc(sysdate),trunc(sysdate)) ;
Now I want to merge t2 into t1 using a merge statement.
But I have a slightly different requirement -
1. if an object_id exists in t2 (e.g all rows except the one with object_i = -1) then i should update
the existing rows as updated = trunc(sysdate)
and then insert the same row(i.e the row which I am going to update) as updateed = to_date('01-01-2199','mm-dd-yyyy') ;
so the expected ouput data in t1 is
OBJECT_ID OBJECT_NAME CREATED UPDATED
---------- ------------------------------ --------------- ---------------
17 FILE$ 23-DEC-07 01-JAN-2199
51 BOOTSTRAP$ 23-DEC-07 01-JAN-2199
25 I_COBJ# 23-DEC-07 01-JAN-2199
44 I_CON2 23-DEC-07 01-JAN-2199
17 FILE$ 23-DEC-07 24-DEC-2007
51 BOOTSTRAP$ 23-DEC-07 24-DEC-2007
25 I_COBJ# 23-DEC-07 23-DEC-2007
44 I_CON2 23-DEC-07 24-DEC-2007
-1 TEST 24-DEC-07 24-DEC-2007
I know this merge statement will update or insert but I need
update and innsert (if matched)
or
insert (if not matched)
merge into t1
using (select * from t2) t
on (t1.object_id = t.object_id)
when matched then
update set t1.object_name=t.object_name
when not matched then
insert values (t.object_id,t.object_name)
/
appreciate your help.
December 24, 2007 - 9:17 am UTC
you would not be using merge to update using a record and insert that record.
that is not what merge does.
is there another way to do this
Reene, December 24, 2007 - 9:23 am UTC
Hi Tom
yes merge don;t work like this,but is it another better way to do that using a single SQL not pl/sql.
Thanks
December 24, 2007 - 9:28 am UTC
merge is the only thing that can insert OR update.
but it won't do both with the same source row.
insert inserts, update updates, delete deletes, merge merges - that is all...
you will update
and then insert
Zafar, May 22, 2009 - 4:03 am UTC
Hi Tom,
I am building ETL for a SCD Type 2 dimension with a similar approach that you stated (First update to expire the existing rows in Dimension and then Merge (using merge, instead of Insert, is appropriate with my particular situation).
The only concern that I have is that if these two operations are not in unison i.e. for e.g. if the update executes and merge/insert fails, their will be data issues in the target table. Any suggesstion on how to handle this? Also, what can be a suggested recovery mechanism in case an error occurs in execution in one of the steps?
May 23, 2009 - 10:18 pm UTC
we call this "a transaction"
a series of SQL statements either entirely succeed or entirely fail.
so, update to expire
then merge to refresh data
THEN COMMIT <<<=== this makes both of the prior statements permanent
if error then ROLLBACK <<<=== this makes any work of the above two statements go away.
Truth be told, if you have to ask this, you have a lot of reading to do before you can safely touch the database - suggest you start with the concepts guide and read it from star to finish. Don't take that wrong, but transactions are sort of the core foundation upon which everything else is built - if you didn't know about them, you have some learning to do before you start programming production applications.
Dimensional Indexes
A reader, June 07, 2009 - 5:44 am UTC
Hi Tom,
Thanks for suggesting the approach of merging the delta with the origional dimension.
We have a dimension which is partitioned by is_current_flag where history (type 2) is maintained in a separate partition and the current records are in a separate partition which are the ACTIVE records.
On a daily basis we recieve 1.5 million rows as the delta out of which 20% are new and rest are type 1 or type 2 changes.
The delta records have to merge with 304 million records and appropriately sit in either historical or current partition.
For updates to type 2 we generate a new surrogate key. We have bitmaps on 8 columns for this dimension.
Curently we are creating another TEMP table out of the current dimension. Apply the new inserts and merge as transformation. Build all bitmaps indexes on the TEMP table.
The last step is do EXCHANGE PARTITION with the existing dimension.However the time spent to create the indexes consumes about 4 hrs which is too much.
I was thinking to apply a merge statement to the existing dimension as the delta set is too small. Should update to 1.5 million rows with BITMAP INDEXES on 8 columns looks feasible solution or it can become worse.
What other solutions you can propose, appreciate your response.
Dimensional indexes
A reader, June 09, 2009 - 11:40 pm UTC
Hi Tom,
Can you please answer to the above question.
June 10, 2009 - 11:56 am UTC
didn't really have anything to say, so I didn't.
Ok, i'll say one thing:
... I was thinking to apply a merge statement to the existing dimension as the delta set is too small.
Should update to 1.5 million rows with BITMAP INDEXES on 8 columns looks feasible solution or it
can become worse.
...
in a word: benchmark.
if you do it in bulk, it can be very efficient. if you do it slow by slow, it'll kill you.
but in the end, you have to benchmark it.
Dimensional indexes
A reader, June 10, 2009 - 12:03 pm UTC
Thanks Tom,
Merge itself being a bulk operation and not row by row approach will be benchmarked.
Though one question will the indexes be updated in bulk after the operation is complete or the index maintenance will be for every row that touches the data block.
Thanks
June 10, 2009 - 3:01 pm UTC
we attempt to manage bitmap updates in bulk during that single sql statement, yes.
SCD Type 2 implementation
Anamika, September 10, 2009 - 8:06 am UTC
Hi Tom,
I'm need to using the SCD Type2 concept regarding a table that gets new records as well as modified records once in a while, so inorder to handle this like i have to update the date feild, effective_thru for existing records and insert the new records as it is?
Can you help me in writing the right code for this as to how can i handle this?
September 14, 2009 - 11:12 am UTC
update t set effective_thru = sysdate
where key in (select key from new_inputs);
insert into t ... select from new_inputs;
A reader, September 10, 2009 - 8:09 am UTC
Sorry i need to be written in PL/SQL code
September 14, 2009 - 11:13 am UTC
ummm
NO
YOU
DON'T
there is no procedural code necessary, desired or otherwise appropriate here.
My Solution for implementing SCD-2 using PL/SQL
Ranjeet Srivastava, September 20, 2009 - 1:11 am UTC
My approach here will be to help you create a Re-usable/ Common-Component to achieve SCD-2 implementation using 100% SQL and PL/SQL.
Step#1. First of all bring the DELTA from your source (I can discuss the ETL Strategy for Incremental Data load in separate session if you need/want).
Step#2. Create a control_table(Source_Sytem_ID, Source_Table_Name, Source_Column_Name, SCD_Type).
Step#3. List down all the Source Tables that needs SCD-2 or SCD-3, this is one time task to make it re-usable component, here you may need to consult you business people to freez this.
Step#4. Design your Target Data Warehouse Table along with following additional Columns (Effective_From, Effective_To, Active_Flag).
Step#5. For every Delta, based on the DML type (whether it is Insert or Update or Delete-generally soft delete becomes Update), check for its SCD qualification into the Control_Table (as created above).
Step#6. If it qualifies for SCD-2 then first go to the target update the Effective_To column to the current date and change the Active_Flag to '0' zero ('0' means inactive, '1' means active).
Step#7. Insert the new updated record with Effective_From=Current Date and Effective_To='either null or some future distant date like 31-12-9999'. Simultaniously, update the Active_Flag to '1'.
For all these purpose, you can use Stored Procedure using PL/SQL.
Bulk Collect can help you achive it faster (Set the Limit, based on your requirement)
If you don't require Incremental Data Load approach, then you can follow following algorithm:==>
The algorithm to update the dimension table (having less no of records-few hundreds) in SCD-2 will be:
FOR ALL ROWs in the SOURCE query LOOP
IF ( a corresponding row exists in the DIM TABLE
THEN
IF data are different THEN
1. UPDATE the "CURRENT" row of the DIM TABLE
SET VALID_TO = NOW
CURRENT = 'N'
2. INSERT new version into DIM TABLE
END IF
ELSE
INSERT new row into DIM TABLE
END IF
NB: You may use buld collect to achieve this faster.....in Oracle.
Dimensional Query
Bhushan, October 12, 2009 - 5:22 pm UTC
Dear Thomas,
Very good information as usual.I need your suggestion/advice on the following.I know the best way would be benchmarking but then any leads on other alternatives will be helpful.
1) Table Structure
create table BRS_TRAN_DTL
(
DEFAULTCODE VARCHAR2(10) not null,
TRAN_NO VARCHAR2(12) not null,
TRAN_CODE VARCHAR2(10) not null,
TRAN_DATE DATE not null,
SEQ NUMBER not null,
COMPANY VARCHAR2(10) not null,
UNIT VARCHAR2(10),
LOCATION VARCHAR2(10),
PARTY VARCHAR2(20),
PARTY_TEXT VARCHAR2(50),
INV_NO VARCHAR2(12),
INV_DATE DATE,
PRD_GROUP VARCHAR2(20) not null,
PRODUCT VARCHAR2(20) not null,
PRODUCT_DTL VARCHAR2(50) not null,
PRODUCT_DTL_TEXT VARCHAR2(1000),
PRICE NUMBER,
QTY NUMBER,
UM VARCHAR2(10),
AMOUNT_INV NUMBER,
BASE_CURRENCY VARCHAR2(10),
TRAN_CURRENCY VARCHAR2(10),
BASE_AMOUNT NUMBER,
TRAN_AMOUNT NUMBER,
ACCT_BALANCE NUMBER,
EXCH_RATE_TRAN_2_BASE NUMBER,
VAT_AMT NUMBER,
WAYBILL_NO VARCHAR2(10),
BATCH_NO VARCHAR2(10),
SETTLED_YN VARCHAR2(1),
COMENT VARCHAR2(120),
STORE VARCHAR2(10),
CREATED_BY VARCHAR2(30),
CREATED_DATE DATE,
SEALED_YN VARCHAR2(1),
DISCOUNT_PERCENT NUMBER,
TRAN_EFFECT VARCHAR2(1) not null,
MODIFIED_BY VARCHAR2(30),
MODIFIED_DATE DATE,
MAJOR_GROUP VARCHAR2(20),
STOCK_YN VARCHAR2(1),
DTL_REQ_YN VARCHAR2(1),
CSV_STRING VARCHAR2(2000) not null,
TOUNIT VARCHAR2(10),
FROMUNIT VARCHAR2(10),
TRAN_REF_NO VARCHAR2(12),
STN_QTY NUMBER,
CSV_SEQ NUMBER,
TRAN_EXCH_AMOUNT NUMBER,
BASE_EXCH_AMOUNT NUMBER,
BASE_PRICE NUMBER,
CSV_VALUE NUMBER,
STN_VALUE NUMBER,
CONS_STRING VARCHAR2(2000),
SUBGROUP VARCHAR2(20),
PERIOD DATE,
PU_DISCOUNT NUMBER,
EXPIRY_DATE DATE,
TRANSPORT_COST NUMBER,
EXP_STRING VARCHAR2(100),
CSV_STRING_BATCH VARCHAR2(2000),
CSV_VALUE_BATCH NUMBER
)
Number of rows > 75000
For some reasons the developers thought creating more indexes is going to help so they ceated more than 15 indexes on various columns. Let us ignore them for now.
To continue further If on a particular date we need to calculate OPENING STOCK and current STOCK of various products do you think this will be a good approach
SELECT SUM (op_stk),SUM(curr_stk),SUM(period_stk),
product,
defaultcode
FROM(
SELECT
CASE WHEN to_date(tran_date,'DD-MON-RRRR') <= to_date('30-jun-2008','DD-MON-RRRR') THEN
decode(tran_effect,'R',qty,'S',-qty) END op_stk,
CASE WHEN to_date(tran_date,'DD-MON-RRRR') > to_date)'30-jun-2008','DD-MON-RRRR') THEN
decode(tran_effect,'R',qty,'S',-qty) END period_stk,
decode(tran_effect,'R',qty,'S',-qty) curr_stk,
product,
defaultcode
FROM brs_tran_dtl )
GROUP BY product,
defaultcode
Any other way of getting this result will be helpful.
Kind Regards,
Bhushan
Above Post
Bhushan, November 05, 2009 - 9:26 am UTC
Thanks for Ignoring i will come back with more details :)