Skip to Main Content

Breadcrumb

Question and Answer

Tom Kyte

Thanks for the question, Pradhan.

Asked: February 14, 2004 - 9:41 pm UTC

Answered by: Tom Kyte - Last updated: September 14, 2009 - 11:13 am UTC

Category: Database - Version: 8.1.7

Viewed 10K+ times! This question is

You Asked

Hello Tom,

We have data warehouse. On daily basis we are planning to receive one file 3Gb size and approximate 3 millions records. The problem the file from source contains same record daily, some of the accounts can be added and some of the accounts are changed. Before loading the file into the target table we have to check three conditions. Compare the record by record with data in target table, if the record is same, just ignore do not do anything,2) if the record has changed, load the updated record and at the same time load the old record in history table 3) insert the new records into target.

As source system is not able to send only changed and new records, it is very painful at our end to do checking of the existing records and filter out and then second part changed records will be inserted in the target table and the old will be inserted in history table and new records will go in target.

Right in testing phase its taking 1.5 hours for 7000 record, so we are not when the actual file 3 million records will come daily, how we will handle.

Any ideas, suggestions, along WITH EXAMPLE OR PROCEDURE will be greatly appreciated.

Thanking you in advance.

Pradhan

and we said...

this should take a couple of minutes at best....

Load the 3 million records into a "scratch table". Index it, etc.

first thing is to "move" into history, well that is:

insert into history
(
select *
from existing_table E, scratch_table S
where <join condition>
and ((E.f1 <> S.f1 or (E.f1 is null and S.f1 is not null)
or (E.f1 is not null and S.f1 is null))
or
(E.f2 <> S.f2 or (E.f2 is null and S.f2 is not null)
or (E.f2 is not null and S.f2 is null))
or
.......
)

Then, I would

a) drop table existing_table
b) rename table scratch_table to existing_table


Done. Sounds like you get a complete refresh right -- so rather then "update existing", just "replace the entire thing".

The easiest way to speed up something is to not do it at all. So, rather then try to update and "merge", just "replace it all".


If you need to capture DELETES as well -- just make the above JOIN an OUTER join.

and you rated our response

  (19 ratings)

Is this answer out of date? If it is, please let us know via a Review

Reviews

Slowly changing dimensions type 2

February 15, 2004 - 2:57 pm UTC

Reviewer: A reader

Pradhan,

You are metion "Type 2 Dimension", do you mean by that slowly changing dimensions type 2 (SCD 2)?

A short description of SCD2:
Using Type 2, a new record with the new attributes is added to the dimension table. Historical fact table rows continue to reference the old dimension key with the old roll-up attribute; going forward, the fact table rows will reference the new surrogate key with the new roll-up thereby perfectly partitioning history (from www.rkimball.com/html/designtips/2000/designtip15.html)

So with SCD 2 you would keep the entire history of your dimension in the dimension. Fact data from August 2000 will for ever point to dimension attributes which were valid in August 2000 and fact data from May 2003 will for ever point to dimension attributes valid in May 2003. It's possible that that attribute didn't change between August 2000 and May 2003 so it would be the same dimension record but in case something changed it would be two different dimension records.

So if you are using SCD 2 why would you move your old row to a history table? When a query is executed against your mart, how should the DB know when it has to look in the history table and when in the actuall dimension?

Regards
Maurice








alternative

February 16, 2004 - 4:31 am UTC

Reviewer: Ferenc Palyi from Hungary

(E.f1 <> S.f1 or (E.f1 is null and S.f1 is not null)
or (E.f1 is not null and S.f1 is null))

An alternative for the above statement:

decode(e.f1, s.f1, 0, 1) = 1

Tom Kyte

Followup  

February 16, 2004 - 7:38 am UTC

you'll find the original to be marginally faster -- i prefer the verbose approach here.

But if you can't replace the table

February 16, 2004 - 8:57 am UTC

Reviewer: Peter from UK

If you use materialized views and have logging over the dimension table you may not be able to replace the dimension table without triggering a large amount of MV changes.
You could consider using a set based check on the loaded table to ignore identical rows. As you state Oracle 8.1.7 you will not have access to MERGE (shame, as that alone is well worth the move to 9.2) so you will need to process the updates and inserts separately.
Steps to follow:
1) load the data file to a STAGE area
2) insert into a CHANGES table all the rows that are NOT identical to those in the DIMENSION table (use something like Select * from STAGE MINUS select * from DIMENSION).
3) update all rows that exist in both the DIMENSION and CHANGES tables
4) insert all the rows from the CHANGES table that do not exist in the DIMENSION.

For simplicity I have put the updates first then the inserts - you may find that for efficiency you should go insert then update. But if you had Oracle 9 then use Merge as it 1)set based 2)cool

For one of 8i data warehouses we attempt to update all the DIMENSION records we have a changes value for and use exception handling to process the inserts when the we get a missing dimension value. Our code to this is row by row using a cursor but performs quite well

Tom Kyte

Followup  

February 16, 2004 - 9:12 am UTC

they made no mention whatsoever of other stuff.

If they come up with other constraints, I'd be glad to address them...




Type 2 Dimension

February 17, 2004 - 4:25 pm UTC

Reviewer: A reader

Thanking you very much.

Rapidly changing monster dimensions

September 21, 2004 - 8:04 pm UTC

Reviewer: A reader

I dont want to get bogged down with semantics or buzzwords here with how Kimball defined things, but we have the following problem:

A million row table containing a bunch of numbers that we wish to be able to provide access to the data for 30 days. Only about 5% of the data changes from day to day

Without sacrificing query performance, how can I provide the logical equivalent of having a 30million row table but store only the delta from day to day?

[The textbook technique for handling such rapidly changing monster dimensions suggests to create another table with the attributes that change less frequently and create "bands" from them. Thats not an option here since all my numeric values are continuous values, cannot be banded into discrete ranges]

Help? Thanks

Tom Kyte

Followup  

September 22, 2004 - 7:22 am UTC

see

</code> http://asktom.oracle.com/pls/asktom/f?p=100:11:::::P11_QUESTION_ID:4529781014729 <code>

for a quick example and check out the application developers guide for workspace manager for further details.

Is this approach right?

June 20, 2006 - 9:40 am UTC

Reviewer: A reader from India

Hi,

We have a set of pl/sql procedures which do the ETL process for our data warehouse. Is it a good design to have a table which will list all the avaialable procedures, their active/inactive states and the sequence of execution. I can then have a main package which will call the procedures to be executed in the sequence defined. This way any new procedures introduced at a later point can be easily plugged into the ETL process.
Is this approach correct or is there a better way of doing this? Please suggest.

Thanks in advance.


Tom Kyte

Followup  

June 20, 2006 - 10:13 am UTC

depends on how frequently this is executed. dynamic sql is "neat" but generic is typically less efficient. but if this is infrequently executed - probably "ok" in this case.

Best approach to update a SCD Type 2 table ?

September 14, 2006 - 2:46 am UTC

Reviewer: Philippe from Brussels - Belgium

Hi Tom,

We have a data warehouse in an Oracle 10.2 database... and we need to update SCD Type 2 dimension tables.

The algorithm to update the table is :

FOR ALL ROWs in the SOURCE query LOOP
IF ( a corresponding row exists in the DIM TABLE
THEN
IF data are different THEN
1. UPDATE the "CURRENT" row of the DIM TABLE
SET VALID_TO = NOW
CURRENT = 'N'
2. INSERT new version into DIM TABLE
END IF
ELSE
INSERT new row into DIM TABLE
END IF

I don't think the algorithm you suggest in your first follow up is appropriate to our situation.

Of course, implementing this row by row in PL/SQL is not the fastest approach.

I see two approaches :
- express this in a single MERGE sql statement (but I don't see how)
- use PL/SQL FETCH BULK COLLECT...

Which is the best approach ?
Is there another way to do this ?

NB : - at the first run, all rows will be NEW => inserted.
- for the next runs, only a few lines will have to be inserted and updated (less than 5% of the DIM table)

Thanks in advance for your advice !

Philippe

Tom Kyte

Followup  

September 14, 2006 - 9:10 am UTC

or a single update statement to retire existing rows and a single insert to insert new rows.

Best approach to update a SCD Type 2 table ?

September 14, 2006 - 9:32 am UTC

Reviewer: A reader

If the "source query" is a complex view on a remote database... which of the approaches would be the best one.

In your suggested one, this view would have to be queried twice ...


Tom Kyte

Followup  

September 14, 2006 - 9:55 am UTC

insert into gtt
select complex query;


do update from gtt;
do insert from gtt;


is one approach.

how to do this using merge

December 24, 2007 - 3:41 am UTC

Reviewer: Reene

drop table t1;
drop table t2;
create table t1 as select object_id,object_name,trunc(sysdate)-1 created,trunc(sysdate)-1 updated from all_objects where rownum < 5;
create table t2 as select * from t1 ;
exec dbms_stats.gather_table_stats(user,'T1');
exec dbms_stats.gather_table_stats(user,'T2');


insert into t2 values (-1,'TEST',trunc(sysdate),trunc(sysdate)) ;


Now I want to merge t2 into t1 using a merge statement.

But I have a slightly different requirement -

1. if an object_id exists in t2 (e.g all rows except the one with object_i = -1) then i should update

the existing rows as updated = trunc(sysdate)

and then insert the same row(i.e the row which I am going to update) as updateed = to_date('01-01-2199','mm-dd-yyyy') ;

so the expected ouput data in t1 is

OBJECT_ID OBJECT_NAME CREATED UPDATED
---------- ------------------------------ --------------- ---------------
17 FILE$ 23-DEC-07 01-JAN-2199
51 BOOTSTRAP$ 23-DEC-07 01-JAN-2199
25 I_COBJ# 23-DEC-07 01-JAN-2199
44 I_CON2 23-DEC-07 01-JAN-2199
17 FILE$ 23-DEC-07 24-DEC-2007
51 BOOTSTRAP$ 23-DEC-07 24-DEC-2007
25 I_COBJ# 23-DEC-07 23-DEC-2007
44 I_CON2 23-DEC-07 24-DEC-2007
-1 TEST 24-DEC-07 24-DEC-2007

I know this merge statement will update or insert but I need

update and innsert (if matched)
or
insert (if not matched)
merge into t1
using (select * from t2) t
on (t1.object_id = t.object_id)
when matched then
update set t1.object_name=t.object_name
when not matched then
insert values (t.object_id,t.object_name)
/

appreciate your help.
Tom Kyte

Followup  

December 24, 2007 - 9:17 am UTC

you would not be using merge to update using a record and insert that record.

that is not what merge does.

is there another way to do this

December 24, 2007 - 9:23 am UTC

Reviewer: Reene

Hi Tom

yes merge don;t work like this,but is it another better way to do that using a single SQL not pl/sql.


Thanks
Tom Kyte

Followup  

December 24, 2007 - 9:28 am UTC

merge is the only thing that can insert OR update.

but it won't do both with the same source row.

insert inserts, update updates, delete deletes, merge merges - that is all...

you will update
and then insert


May 22, 2009 - 4:03 am UTC

Reviewer: Zafar from India

Hi Tom,

I am building ETL for a SCD Type 2 dimension with a similar approach that you stated (First update to expire the existing rows in Dimension and then Merge (using merge, instead of Insert, is appropriate with my particular situation).
The only concern that I have is that if these two operations are not in unison i.e. for e.g. if the update executes and merge/insert fails, their will be data issues in the target table. Any suggesstion on how to handle this? Also, what can be a suggested recovery mechanism in case an error occurs in execution in one of the steps?
Tom Kyte

Followup  

May 23, 2009 - 10:18 pm UTC

we call this "a transaction"

a series of SQL statements either entirely succeed or entirely fail.

so, update to expire
then merge to refresh data
THEN COMMIT <<<=== this makes both of the prior statements permanent

if error then ROLLBACK <<<=== this makes any work of the above two statements go away.


Truth be told, if you have to ask this, you have a lot of reading to do before you can safely touch the database - suggest you start with the concepts guide and read it from star to finish. Don't take that wrong, but transactions are sort of the core foundation upon which everything else is built - if you didn't know about them, you have some learning to do before you start programming production applications.

Dimensional Indexes

June 07, 2009 - 5:44 am UTC

Reviewer: A reader from India

Hi Tom,

Thanks for suggesting the approach of merging the delta with the origional dimension.

We have a dimension which is partitioned by is_current_flag where history (type 2) is maintained in a separate partition and the current records are in a separate partition which are the ACTIVE records.

On a daily basis we recieve 1.5 million rows as the delta out of which 20% are new and rest are type 1 or type 2 changes.

The delta records have to merge with 304 million records and appropriately sit in either historical or current partition.

For updates to type 2 we generate a new surrogate key. We have bitmaps on 8 columns for this dimension.

Curently we are creating another TEMP table out of the current dimension. Apply the new inserts and merge as transformation. Build all bitmaps indexes on the TEMP table.

The last step is do EXCHANGE PARTITION with the existing dimension.However the time spent to create the indexes consumes about 4 hrs which is too much.

I was thinking to apply a merge statement to the existing dimension as the delta set is too small. Should update to 1.5 million rows with BITMAP INDEXES on 8 columns looks feasible solution or it can become worse.

What other solutions you can propose, appreciate your response.

Dimensional indexes

June 09, 2009 - 11:40 pm UTC

Reviewer: A reader from india

Hi Tom,

Can you please answer to the above question.
Tom Kyte

Followup  

June 10, 2009 - 11:56 am UTC

didn't really have anything to say, so I didn't.

Ok, i'll say one thing:

... I was thinking to apply a merge statement to the existing dimension as the delta set is too small.
Should update to 1.5 million rows with BITMAP INDEXES on 8 columns looks feasible solution or it
can become worse.
...

in a word: benchmark.


if you do it in bulk, it can be very efficient. if you do it slow by slow, it'll kill you.

but in the end, you have to benchmark it.

Dimensional indexes

June 10, 2009 - 12:03 pm UTC

Reviewer: A reader from india

Thanks Tom,

Merge itself being a bulk operation and not row by row approach will be benchmarked.

Though one question will the indexes be updated in bulk after the operation is complete or the index maintenance will be for every row that touches the data block.

Thanks
Tom Kyte

Followup  

June 10, 2009 - 3:01 pm UTC

we attempt to manage bitmap updates in bulk during that single sql statement, yes.

SCD Type 2 implementation

September 10, 2009 - 8:06 am UTC

Reviewer: Anamika from India

Hi Tom,

I'm need to using the SCD Type2 concept regarding a table that gets new records as well as modified records once in a while, so inorder to handle this like i have to update the date feild, effective_thru for existing records and insert the new records as it is?

Can you help me in writing the right code for this as to how can i handle this?


Tom Kyte

Followup  

September 14, 2009 - 11:12 am UTC

update t set effective_thru = sysdate
where key in (select key from new_inputs);

insert into t ... select from new_inputs;


September 10, 2009 - 8:09 am UTC

Reviewer: A reader from India

Sorry i need to be written in PL/SQL code
Tom Kyte

Followup  

September 14, 2009 - 11:13 am UTC

ummm

NO
YOU
DON'T

there is no procedural code necessary, desired or otherwise appropriate here.

My Solution for implementing SCD-2 using PL/SQL

September 20, 2009 - 1:11 am UTC

Reviewer: Ranjeet Srivastava from India

My approach here will be to help you create a Re-usable/ Common-Component to achieve SCD-2 implementation using 100% SQL and PL/SQL.
Step#1. First of all bring the DELTA from your source (I can discuss the ETL Strategy for Incremental Data load in separate session if you need/want).
Step#2. Create a control_table(Source_Sytem_ID, Source_Table_Name, Source_Column_Name, SCD_Type).
Step#3. List down all the Source Tables that needs SCD-2 or SCD-3, this is one time task to make it re-usable component, here you may need to consult you business people to freez this.
Step#4. Design your Target Data Warehouse Table along with following additional Columns (Effective_From, Effective_To, Active_Flag).
Step#5. For every Delta, based on the DML type (whether it is Insert or Update or Delete-generally soft delete becomes Update), check for its SCD qualification into the Control_Table (as created above).
Step#6. If it qualifies for SCD-2 then first go to the target update the Effective_To column to the current date and change the Active_Flag to '0' zero ('0' means inactive, '1' means active).
Step#7. Insert the new updated record with Effective_From=Current Date and Effective_To='either null or some future distant date like 31-12-9999'. Simultaniously, update the Active_Flag to '1'.

For all these purpose, you can use Stored Procedure using PL/SQL.
Bulk Collect can help you achive it faster (Set the Limit, based on your requirement)

If you don't require Incremental Data Load approach, then you can follow following algorithm:==>

The algorithm to update the dimension table (having less no of records-few hundreds) in SCD-2 will be:

FOR ALL ROWs in the SOURCE query LOOP
  IF ( a corresponding row exists in the DIM TABLE 
  THEN 
    IF data are different THEN
      1. UPDATE the "CURRENT" row of the DIM TABLE 
           SET VALID_TO = NOW
               CURRENT = 'N'
      2. INSERT new version into DIM TABLE 
    END IF
  ELSE
    INSERT new row into DIM TABLE 
  END IF


NB: You may use buld collect to achieve this faster.....in Oracle.

Dimensional Query

October 12, 2009 - 5:22 pm UTC

Reviewer: Bhushan from Lagos, Nigeria

Dear Thomas,
Very good information as usual.I need your suggestion/advice on the following.I know the best way would be benchmarking but then any leads on other alternatives will be helpful.
1) Table Structure
create table BRS_TRAN_DTL
(
DEFAULTCODE VARCHAR2(10) not null,
TRAN_NO VARCHAR2(12) not null,
TRAN_CODE VARCHAR2(10) not null,
TRAN_DATE DATE not null,
SEQ NUMBER not null,
COMPANY VARCHAR2(10) not null,
UNIT VARCHAR2(10),
LOCATION VARCHAR2(10),
PARTY VARCHAR2(20),
PARTY_TEXT VARCHAR2(50),
INV_NO VARCHAR2(12),
INV_DATE DATE,
PRD_GROUP VARCHAR2(20) not null,
PRODUCT VARCHAR2(20) not null,
PRODUCT_DTL VARCHAR2(50) not null,
PRODUCT_DTL_TEXT VARCHAR2(1000),
PRICE NUMBER,
QTY NUMBER,
UM VARCHAR2(10),
AMOUNT_INV NUMBER,
BASE_CURRENCY VARCHAR2(10),
TRAN_CURRENCY VARCHAR2(10),
BASE_AMOUNT NUMBER,
TRAN_AMOUNT NUMBER,
ACCT_BALANCE NUMBER,
EXCH_RATE_TRAN_2_BASE NUMBER,
VAT_AMT NUMBER,
WAYBILL_NO VARCHAR2(10),
BATCH_NO VARCHAR2(10),
SETTLED_YN VARCHAR2(1),
COMENT VARCHAR2(120),
STORE VARCHAR2(10),
CREATED_BY VARCHAR2(30),
CREATED_DATE DATE,
SEALED_YN VARCHAR2(1),
DISCOUNT_PERCENT NUMBER,
TRAN_EFFECT VARCHAR2(1) not null,
MODIFIED_BY VARCHAR2(30),
MODIFIED_DATE DATE,
MAJOR_GROUP VARCHAR2(20),
STOCK_YN VARCHAR2(1),
DTL_REQ_YN VARCHAR2(1),
CSV_STRING VARCHAR2(2000) not null,
TOUNIT VARCHAR2(10),
FROMUNIT VARCHAR2(10),
TRAN_REF_NO VARCHAR2(12),
STN_QTY NUMBER,
CSV_SEQ NUMBER,
TRAN_EXCH_AMOUNT NUMBER,
BASE_EXCH_AMOUNT NUMBER,
BASE_PRICE NUMBER,
CSV_VALUE NUMBER,
STN_VALUE NUMBER,
CONS_STRING VARCHAR2(2000),
SUBGROUP VARCHAR2(20),
PERIOD DATE,
PU_DISCOUNT NUMBER,
EXPIRY_DATE DATE,
TRANSPORT_COST NUMBER,
EXP_STRING VARCHAR2(100),
CSV_STRING_BATCH VARCHAR2(2000),
CSV_VALUE_BATCH NUMBER
)
Number of rows > 75000
For some reasons the developers thought creating more indexes is going to help so they ceated more than 15 indexes on various columns. Let us ignore them for now.
To continue further If on a particular date we need to calculate OPENING STOCK and current STOCK of various products do you think this will be a good approach

SELECT SUM (op_stk),SUM(curr_stk),SUM(period_stk),
product,
defaultcode
FROM(
SELECT
CASE WHEN to_date(tran_date,'DD-MON-RRRR') <= to_date('30-jun-2008','DD-MON-RRRR') THEN
decode(tran_effect,'R',qty,'S',-qty) END op_stk,
CASE WHEN to_date(tran_date,'DD-MON-RRRR') > to_date)'30-jun-2008','DD-MON-RRRR') THEN
decode(tran_effect,'R',qty,'S',-qty) END period_stk,

decode(tran_effect,'R',qty,'S',-qty) curr_stk,
product,
defaultcode

FROM brs_tran_dtl )
GROUP BY product,
defaultcode
Any other way of getting this result will be helpful.

Kind Regards,
Bhushan

Above Post

November 05, 2009 - 9:26 am UTC

Reviewer: Bhushan from Lagos, NIgeria

Thanks for Ignoring i will come back with more details :)