Skip to Main Content
  • Questions
  • Merge Insert 50 million records by finding only new records

Breadcrumb

Question and Answer

Tom Kyte

Thanks for the question, shaheem.

Asked: April 20, 2014 - 10:20 am UTC

Last updated: April 23, 2014 - 9:02 pm UTC

Version: Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production

Viewed 10K+ times! This question is

You Asked

Hi Tom

Im having a Full dump of 50ish million records arriving every week.
Every week there would be,lets say, a 5 % increase in the Full Dump.


My task is to obtain this set into STG(Staging) table as it is , then move this into a DIMension (prod) table with a new Surrogate Key.

STG table
--------------
Col 1
col 2


DIM Table
----------
Col 1
col 2
SEQ_SK (Unique Surrogate Key)


The Vendor is not sending an incremental data, but always a full dump, and that is the concern.
Due to this i need to identify which are the new records and then perform an insert every week.

Currently i have written a MERGE / INSERT statement which im sure is not ideal.

Could you suggest a better approach to this issue.

-----------------------------------------------------------------------

FIRST WEEK - Full dump arrives (50 million records)
step 1 : loads 50 million records into STG table
step 2 : insert 50million records into DIMension table(with new key generated in a new column DIM_SK in DIM table )

SECOND WEEK - Full dump arrives (this time 53 milion records. only 3 million rows are new)
Step 1 : 53 million records into STG

step 2 : ? --> here is the question.Here i need to pick only those 3 million and then insert using new keys into DIM table(which has 50 million records from prev week with keys gen previously) without affecting the existing records.We cannot anyway delete the old DIM data as the DIM_SK of lastweek is being used and is already tagged to FACT tables.



Thanks.








and Tom said...

Use a direct path load


something like this:


ops$tkyte%ORA11GR2> create sequence s;

Sequence created.

ops$tkyte%ORA11GR2> 
ops$tkyte%ORA11GR2> create table stage
  2  as
  3  select *
  4    from all_objects
  5   where mod(object_id,2) = 0
  6  /

Table created.

ops$tkyte%ORA11GR2> create table t
  2  as
  3  select 0 pk, a.*
  4    from all_objects a
  5   where 1=0
  6  /

Table created.

ops$tkyte%ORA11GR2> 
ops$tkyte%ORA11GR2> insert /*+ append */
  2  into t ( pk, OWNER, OBJECT_NAME, SUBOBJECT_NAME, OBJECT_ID, DATA_OBJECT_ID,
  3           OBJECT_TYPE, CREATED, LAST_DDL_TIME, TIMESTAMP, STATUS,
  4           TEMPORARY, GENERATED, SECONDARY, NAMESPACE, EDITION_NAME)
  5  select s.nextval, x.*
  6    from ( select OWNER, OBJECT_NAME, SUBOBJECT_NAME, OBJECT_ID, DATA_OBJECT_ID,
  7                  OBJECT_TYPE, CREATED, LAST_DDL_TIME, TIMESTAMP, STATUS,
  8                  TEMPORARY, GENERATED, SECONDARY, NAMESPACE, EDITION_NAME
  9             from STAGE
 10                    MINUS
 11                   select OWNER, OBJECT_NAME, SUBOBJECT_NAME, OBJECT_ID, DATA_OBJECT_ID,
 12                  OBJECT_TYPE, CREATED, LAST_DDL_TIME, TIMESTAMP, STATUS,
 13                  TEMPORARY, GENERATED, SECONDARY, NAMESPACE, EDITION_NAME
 14              from t ) x
 15  /

36649 rows created.

ops$tkyte%ORA11GR2> commit;

Commit complete.

ops$tkyte%ORA11GR2> 
ops$tkyte%ORA11GR2> drop table stage;

Table dropped.

ops$tkyte%ORA11GR2> create table stage as select * from all_objects;

Table created.

ops$tkyte%ORA11GR2> 
ops$tkyte%ORA11GR2> insert /*+ append */
  2  into t ( pk, OWNER, OBJECT_NAME, SUBOBJECT_NAME, OBJECT_ID, DATA_OBJECT_ID,
  3           OBJECT_TYPE, CREATED, LAST_DDL_TIME, TIMESTAMP, STATUS,
  4           TEMPORARY, GENERATED, SECONDARY, NAMESPACE, EDITION_NAME)
  5  select s.nextval, x.*
  6    from ( select OWNER, OBJECT_NAME, SUBOBJECT_NAME, OBJECT_ID, DATA_OBJECT_ID,
  7                  OBJECT_TYPE, CREATED, LAST_DDL_TIME, TIMESTAMP, STATUS,
  8                  TEMPORARY, GENERATED, SECONDARY, NAMESPACE, EDITION_NAME
  9             from STAGE
 10                    MINUS
 11                   select OWNER, OBJECT_NAME, SUBOBJECT_NAME, OBJECT_ID, DATA_OBJECT_ID,
 12                  OBJECT_TYPE, CREATED, LAST_DDL_TIME, TIMESTAMP, STATUS,
 13                  TEMPORARY, GENERATED, SECONDARY, NAMESPACE, EDITION_NAME
 14              from t ) x
 15  /

36555 rows created.

ops$tkyte%ORA11GR2> commit;

Commit complete.



the nice thing about minus is that it will deal with NULLs. I'm assuming you want to match on the entire record here. If not, MERGE /*+ APPEND */ will work nicely as well - perhaps better than MINUS.


that will direct path load the data - bypassing UNDO, *optionally* bypassing redo (make sure you understand the ramifications of this!!) and maintaining any existing indexes as efficiently as possible. Also, it will bypass the inefficiencies of the buffer cache and just write directly to datafiles.

You may want to enable parallel query whilst doing this if you have the free cpu, memory and IO bandwidth to do so.

Rating

  (11 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

shaheem cherinkuzhiyil, April 24, 2014 - 8:35 am UTC

Thanks Tom . Appreciate for the Prompt reply.

Also i forgot to point out that the Tables actually contain around 186 columns and we also intend to perform a merge update as any of the 18 attribute may change in the weekly dump.

I mean every week the Full dump will have Dimensional records that contain changes in any of it 185 columns or it might be a totally ne record.

When the MERGE command does the comparison it will obviously touch all the rows plus the column and as such im concerned about the time it may take to complete the MERGE Updat/Insert.

Is the MERGE approach the optimal solution?


Thanks

@shaheem

Stew Ashton, April 24, 2014 - 9:47 am UTC


Shaheem, does your input data have more than one row with exactly the same data? If so, MINUS will remove duplicates and only one of the rows will remain. Is this what you want?

I'm curious as to how you would do a "merge update". To do a MERGE, you have to compare the rows in the two tables. How do you know which rows to compare? You would need the same unique identifier in both tables. Does the input data already have a unique identifier?

Shah., April 24, 2014 - 12:24 pm UTC

Hi Stew,
Btw there are no duplicates . We have a column that is unique when it arrives as Full dump.We the create another Surrogate Key when we Merge this STG data into our Dimension.
The Full Dump comes with a unique Member_ID value plus 180 Attributes of this member.
Hence STG_Member_Master table has MemberId column + 180 attributes column.

Member_Master_DIM will have a Member_ID + 180 cols + new Meber_SK(surrogate Key which we Generate).

Every week the Full dump can have Old + New Records.
Old records may or may not have changes to their 180 columns. Hence Merge Update is used to match the Member_id and update all columns whether or not the 180 columns have any change.(Should we use MINUS here? Considering there are 50 million records + 180 columns to MINUS from )

If the Member Id is not Matched Merge Insert will create a fresh record into DIM table.

Qn : Should we use a MINUS between the STG and DIM table each having 50ish million records within the MERGE UPDATE/INSERT stmnt.

OR

Should we use a Merge update/Insert and update all columns of all existing records.


Thanks,
Shah.



180 columns

George Joseph, April 25, 2014 - 4:41 am UTC

Hi Shah

About having 180 attributes from the source file, is there a possibility that at least one of the columns is a record keeping field which contains or from where one can derive the "last_updated_date or created_date"?.

If so then it could be worth checking if you can filter early from the stage table itself by using last_updated_date>max_value_date from prod table.(ie bring only those 3 million rows instead of 53 mill) and then do the minus

Cheers
George

@OP

A reader, April 25, 2014 - 2:05 pm UTC

you can do something as below...

select member_id, <your_180_columns>
from
(
 select 1 as flg, member_id, <your_180_columns> from Member_Master_DIM 
 UNION ALL
 select 2, member_id, <your_180_columns> from stage_tbl
)x 
group by member_id, ..... <your_180_columns>
having count(*)=1 ---- count(*)>1 mean - record are same, 1 - means either new or change
and max(flg)=2 ---- to pickup record from stage as you need to update recent values if record re-appear OR if its new record 

you can then use above set to merge with your dimension table.

for e.g.:

consider t1 as your dimension table and t2 as your stage table (which has weekly full dump data)

with t1 as 
(
 select 1 as id, 'A' col1, 100 as col2 from dual union all
 select 2 as id, 'B' col1, 12 as col2 from dual union all
 select 3 as id, 'C' col1, 455 as col2 from dual 
),
t2 as 
(
 select 1 as id, 'A' col1, 120 as col2 from dual union all
 select 2 as id, 'X' col1, 12 as col2 from dual union all
 select 3 as id, 'C' col1, 455 as col2 from dual union all
 select 4 as id, 'W' col1, 45 as col2 from dual 
)
select max(flg), id, col1, col2, count(*)
from
(
 select 1 as flg, d.* from t1 d
 UNION ALL
 select 2, stg.* from t2 stg
)x 
group by id, col1, col2
having count(*)=1 and max(flg)=2 


merge into t1 trg
using
(
 select max(flg), id, col1, col2
 from
 (
  select 1 as flg, d.* from t1 d
  UNION ALL
  select 2, stg.* from t2 stg
 )x 
 group by id, col1, col2
 having count(*)=1 and max(flg)=2 
)src
on trg.id=src.id 
when matched then update ...
when not matched then insert ...



About the Last Approach

Andre Guerreiro, May 17, 2014 - 12:29 am UTC

Hi.

About the last approach using UNION ALL and then GROUPING BY and counting duplicates I would strongly advise against that unless you have a dedicated server that can handle that. Even so I would only use that technique with less data involved.

As to the advise on using MINUS, I'm very surprised by the results and that will hopefully help me a lot in some specific cases. Always a joy to read Tom's posts.

OS level diff

Max, June 13, 2014 - 7:20 am UTC

Use os utility (e.g comm ) for last week's and this week's extract to pull out changes i.e new and changed records. Then this should give smaller set 3m + updates to load in staging and then merge with dimention.

union all + group by vs minus

Rusty, June 16, 2014 - 2:05 pm UTC

union all + group by is the fastest way to identify deltas (removed, added, updated records). It has much better performance than minus. And it allows to select some columns with aggregated function which are out of comparison. For example when the record is updated we need to propagate some value from old version to the new one.

@shaheem

Gabe, June 23, 2014 - 8:18 pm UTC

Also i forgot to point out that ... and we also intend to perform a merge update as any of the 18 attribute may change in the weekly dump.

Old records may or may not have changes to their 180 columns. Hence Merge Update is used to match the Member_id and update all columns whether or not the 180 columns have any change.

That makes it a different problem than what you described initially; so, the solution given won't be applicable.

It seems you want to insert records with new MEMBER_IDs and update records with existing MEMBER_IDs when any of the other attributes were modified.

David Aldridge, October 14, 2014 - 7:29 am UTC

If you're loading data to the staging table and not doing anything useful with it in the process then you're missing a big opportunity. You should try to do as much as possible to the data every time you touch it, and at a minimum I would be calculating a hash of the attribute columns for comparison with a hash of the values in the dimension table.

Thus you do not need to compare a great many columns -- you compare the key values to check for new ones, and the hash values to check for changed attributes.

Of course you need to store these hash values in the dimension table as well.

You can then go a step further and compare the new data and its hash to the existing dimension data during the import, discarding rows that have not changed and eating yourself with only the modified and new records.

As a further step, you can use a multi table insert to send new records to one staging table and changed records to another, allowing a direct path insert of the new records and an update of the changed.

Ultimately you may find that you can abandon the staging table entirely and perform a merge of the data directly into the dimension table.

Another approach would be to just direct path insert the data in its transformed state, reading the appropriate dimension key values and generating new ones, into a new dimension table, and either partition-exchanging it with the original or using a synonym to switch between them.

@Shah on how to MERGE

Stew Ashton, November 30, 2014 - 5:38 pm UTC


Shah,

You may never see this page again, but your final question was:
"Should we use a Merge update/Insert and update all columns of all existing records."

I say yes and no: you should use MERGE and update all columns, but not of all existing records.

Use the WHERE clause within the UPDATE clause of the MERGE statement.
create table t_target as
select level member_id, 'att1' att1, cast(null as varchar2(9)) att2
from dual
connect by level <= 10;

create table t_source as
select * from t_target;

update t_source set att2 = 'new1' where member_id = 1;
insert into t_source values(11,'ins','ert');
commit;

merge /*+ append */ into t_target t
using t_source s
on (t.member_id = s.member_id)
when matched then update
  set att1 = s.att1, att2 = s.att2
  where 1 in (
    decode(s.att1,t.att1,0,1),
    decode(s.att2,t.att2,0,1)
  )
when not matched then insert
  (member_id, att1, att2)
  values (s.member_id, s.att1, s.att2);
Run the MERGE once and you will see "2 rows merged". Now run it a second time: you will get an ORA-12838 exception, which proves that you did a direct path insert of the one new row. Now commit and run it a third time: you will see "0 rows merged".

In the WHERE clause, I use DECODE to compare attributes because it gives "true" or "false" even when NULL values are compared. The "1 IN()" part allows Oracle to stop the comparisons as soon as any values are different.

David Aldridge suggested splitting changed records from new records so new records could benefit from direct path load. This is exactly what the MERGE statement does all by itself.

More to Explore

Hints

Need more information on hints? Check out the complete hint descriptions