Skip to Main Content
  • Questions
  • Dataloading with delete/insert and encountering duplicates

Breadcrumb

Question and Answer

Tom Kyte

Thanks for the question, Peter.

Asked: January 23, 2005 - 8:45 pm UTC

Last updated: April 22, 2005 - 8:49 am UTC

Version: 9.2.0.5

Viewed 1000+ times

You Asked

Hi Tom,

I have design issue I hope you can give me advice on.

We have this table that contains 200 million rows. Each night we need to load around 20 million records. Before we can load, we need to delete any existing records. When we delete we're not using all the PK colums, because the new set of data my contain more or less data.

E.g. An airline will use varying # of fare classes on a daily basis depending on restriction, sales, etc. So one day an airline will use 10 fare classes for a FltNum, but the next day use 8 fare classes.

Table:

FltNum ClsCode Attr1 Attr2 ...
(PK) (PK)

Before we load in the new records, we will delete all records matching FltNum. Then we will bulk insert using java.sql.PreparedStatement. We execute a bunch of addBatch() on an INSERT SQL and then invoke executeBatch(). This works very well.

However there are times when the clients accidently sends up duplicate records in the new set of data to insert.

E.g.

FltNum ClsCode Attr1 Attr2 ...
------ ------- ----- -----
101 A 10 23
101 B 11 25
101 D 12 34 (duplicate)
101 D 12 34 (duplicate)
101 F 22 29

When we try to insert the above dataset, Oracle will throw a primary key violation. We trap for the BatchUpdateException, and then do slow-by-slow single insert, followed by single update on insert failure. So the above example would turn into:

FltNum ClsCode Attr1 Attr2 ...
------ ------- ----- -----
101 A 10 23 <- single insert
101 B 11 25 <- single insert
101 D 12 34 <- single insert
101 D 12 34 <- single insert, pk violiation, single update
101 F 22 29 <- single insert

Needless to say, when we fall into the slow-by-slow single row processing, performance gets hammered.

1) We can't use merge, since the new set of records may contain more less data.
2) We have to anticipate that the clients will send us duplicates.

How would you solve this problem?

Thanks,
-Peter

and Tom said...

I'd be using PL/SQL with a forall i insert and "save exceptions"

(eg: no way I'd use java to load data, it's too hard).

either that -- or direct path load into a stage table, enable the primary key with "exceptions into" (to capture the duplicate rowids), cleanse the stage table, and then insert /*+ append */ this data after doing a "delete from" (eg: 3 bulk operations, no procedural code)


or, use the conventional path loader which would again, bypass the need for procedural code, load the data and put the "bad" (duplicates) into a "bad" file.


or, use external tables and load using SQL, but a multi-table insert, something like:

ops$tkyte@ORA9IR2> delete from t where x in ( select x from et );

7 rows deleted.

ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2> insert
2 when (rn=1) then
3 into t ( x, y, z ) values ( x, y, z )
4 when (rn<>1) then
5 into dups( x,y,z) values (x,y,z)
6 select x,y,z, row_number() over (partition by x,y order by z) rn from et
7 /

8 rows created.



That is, there would not be alot of code in my proposed solution.

Rating

  (10 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Brilliant as usual! Looks like an Oracle manual is missing:

Franco, January 25, 2005 - 6:09 am UTC

"Case Studies", where to find general solutions, involving different techniques, to some typical scenery like this proposed by Peter. Sometimes we all (no matter how much experienced) need good hints to follow reasonable and convenient paths to test, shortcutting quickly fruitless roads.
And this is not the first time that you give answers well structured like this. Just go on! :-)

So Close but need one more detail figured out !!!!

Anurag Mahesh, February 16, 2005 - 2:48 pm UTC

great solution!

we have a very similar situation and are planning to go the external table route - that is :

ops$tkyte@ORA9IR2> insert
2 when (rn=1) then
3 into t ( x, y, z ) values ( x, y, z )
4 when (rn<>1) then
5 into dups( x,y,z) values (x,y,z)
6 select x,y,z, row_number() over (partition by x,y order by z) rn from et
7 /

however the understanding we have with our client is that if the file has duplicates we will pick the 1st row from the file and reject all subsequent ones - Can that be easily accomplished ? i was wondering if

- in the above query instead of 'order by z' we could use 'order by rownum' ?

- what impact will this have if we are reading the external table in parallel?

will greatly appreciate your thoughts on this.

Thanks in advance

Anurag

Tom Kyte
February 16, 2005 - 3:00 pm UTC

can that be easily accomplished? Yes, you have the exact query to do it.... it is done.


order by "rownum" -- no, but

select x,y,z, row_number() over (partition by x,y order by R) rn
from (select rownum r, et.* from et)

would achieve that.

you'll find the concept of rownum and parallel query very much "orthoganal" here (mutually exclusive). You cannot have rownum assigned row by row, if you are processing many rows from all over the file at the same time!

Insert and Update on multiple table join

Vin, February 16, 2005 - 9:11 pm UTC

Is there any better way to perform the UPDATE on a single table based on a resultset from multiple tables then the method given below. 
The reason being, when join on tables with millions of records, it take more then 15 hours to update that one table for 46k records that matches with a resultset from join of multiple tables each with millions of records.
Would appreciate your advice.

SQL> Create table t1(a number, b number, w varchar2(10)); 
SQL> Create table t2(a number, b number, x varchar2(10)); 
SQL> Create table t3(a number, b number, y varchar2(10));
SQL> Create table t4(a number, b number, z varchar2(10));
SQL> Create table t5(a number, b number, w varchar2(10), x varchar2(10), y varchar2(10), z varchar2(10));

SQL> Insert into t1(a,b,w) values (1, 11, 't1');
SQL> Insert into t1(a,b,w) values (2, 22, 't1');
SQL> Insert into t1(a,b,w) values (3, 33, 't1');
SQL> Insert into t1(a,b,w) values (4, 44, 't1');
SQL> Insert into t1(a,b,w) values (5, 55, 't1');
SQL> Insert into t1(a,b,w) values (6, 66, 't1');
SQL> Insert into t1(a,b,w) values (7, 77, 't1');
SQL> Insert into t1(a,b,w) values (8, 88, 't1');

SQL> Insert into t2(a,b,x) values (1, 11, 't2');
SQL> Insert into t2(a,b,x) values (2, 22, 't2');
SQL> Insert into t2(a,b,x) values (3, 33, 't2');
SQL> Insert into t2(a,b,x) values (4, 44, 't2');
SQL> Insert into t2(a,b,x) values (5, 55, 't2');
SQL> Insert into t2(a,b,x) values (6, 66, 't2');

SQL> Insert into t3(a,b,y) values (1, 11, 't3');
SQL> Insert into t3(a,b,y) values (2, 22, 't3');
SQL> Insert into t3(a,b,y) values (3, 33, 't3');
SQL> Insert into t3(a,b,y) values (4, 44, 't3');

SQL> Insert into t4(a,b,z) values (1, 11, 't4');
SQL> Insert into t4(a,b,z) values (2, 22, 't4');

SQL> Select * from t1;

         A          B W
---------- ---------- ----------
         1         11 t1
         2         22 t1
         3         33 t1
         4         44 t1
         5         55 t1
         6         66 t1
         7         77 t1
         8         88 t1

8 rows selected.

SQL> Select * from t2;

         A          B X
---------- ---------- ----------
         1         11 t2
         2         22 t2
         3         33 t2
         4         44 t2
         5         55 t2
         6         66 t2

6 rows selected.

SQL> Select * from t3;

         A          B Y
---------- ---------- ----------
         1         11 t3
         2         22 t3
         3         33 t3
         4         44 t3

SQL> Select * from t4;

         A          B Z
---------- ---------- ----------
         1         11 t4
         2         22 t4

SQL> Select t1.a, t1.b, t1.w, t2.x, t3.y, t4.z
  2    From t1, t2, t3, t4
  3   Where t1.a = t2.a
  4     And t1.b = t2.b
  5     And t1.a = t3.a
  6     And t1.b = t3.b
  7     And t1.a = t4.a
  8     And t1.b = t4.b;

         A          B W  X  Y  Z
---------- ---------- -- -- -- --
         1         11 t1 t2 t3 t4
         2         22 t1 t2 t3 t4

SQL> Insert /*+ APPEND */ into t5 (a, b, w, x, y, z)
  2  Select t1.a, t1.b, t1.w, t2.x, t3.y, t4.z
  3    From t1, t2, t3, t4
  4   Where t1.a = t2.a
  5     And t1.b = t2.b
  6     And t1.a = t3.a
  7     And t1.b = t3.b
  8     And t1.a = t4.a
  9     And t1.b = t4.b;

2 rows created.

SQL> commit;
SQL> Select * from t5;

         A          B W  X  Y  Z
---------- ---------- -- -- -- --
         1         11 t1 t2 t3 t4
         2         22 t1 t2 t3 t4

SQL> Select * from t1 Where a=1 and b=11;

         A          B W
---------- ---------- ----------
         1         11 t1

SQL> Update t1 set w='v' Where a=1 and b=11;

1 row updated.

SQL> commit;
SQL> Select * from t1 Where a=1 and b=11;

         A          B W
---------- ---------- ----------
         1         11 v

SQL> Declare
  2  
  3   v_ucnt number     := 0;
  4   v_uget boolean := false;
  5  
  6  Begin
  7    For x in (Select t1.a, t1.b, t1.w, t2.x, t3.y, t4.z
  8                From t1, t2, t3, t4
  9               Where t1.a = t2.a
 10                 And t1.b = t2.b
 11                 And t1.a = t3.a
 12                 And t1.b = t3.b
 13                 And t1.a = t4.a
 14                 And t1.b = t4.b)
 15    Loop
 16      v_uget :=true;
 17      Update t5 y
 18         Set y.a = x.a,
 19             y.b = x.b,
 20             y.w = x.w,
 21             y.x = x.x,
 22             y.y = x.y,
 23             y.z = x.z
 24       Where y.a = x.a
 25         And y.b = x.b;
 26      v_ucnt := v_ucnt+1;
 27    End Loop;
 28    commit;
 29    If ( NOT v_uget ) then
 30       raise NO_DATA_FOUND;
 31    End If;
 32    dbms_output.put_line( 'No of Rows Updated in Table t5: ' || v_ucnt );
 33  End;
 34  /
No of Rows Updated in Table t5: 2

PL/SQL procedure successfully completed.

SQL> Select * from t5;

         A          B W  X  Y  Z
---------- ---------- -- -- -- --
         1         11 v  t2 t3 t4
         2         22 t1 t2 t3 t4 

Tom Kyte
February 17, 2005 - 7:43 am UTC

in 8i/9i -- i would:

create global temporary table gtt( a,b,w,x,y,z, primary key(a,b) );


once in my database, then to update:

insert into gtt select t1.a, .......;

update ( select gtt.w new_w, gtt.x new_x, gtt.y new_y, gtt.z new_z,
t5.w old_w, t5.x old_x, t5.y old_y, t5.z old_z
from gtt, t5
where gtt.a = t5.a and gtt.b = t5.b )
set old_w = new_w, old_x = new_x, ......;

sql%rowcount has the rows updated.

(note: updating a, b is pointless...)



In 10g I would:

update t5
using ( select ........... ) x
on ( t5.a = x.a and t5.b = x.b )
when matched then update set x = x.x, y = x.y, .....;


no gtt needed.


No procedural code in any case.

forgot to mention, Oracle version is 8174

Vin, February 16, 2005 - 9:35 pm UTC


MERGE?

Adrian, February 17, 2005 - 9:41 am UTC

<quote>
In 10g I would:

>>>>> update t5
using ( select ........... ) x
on ( t5.a = x.a and t5.b = x.b )
when matched then update set x = x.x, y = x.y, .....;
</unquote>

I think you mean "MERGE INTO". But it's a great technique and new feature of MERGE in 10g and could be the death of UPDATE in most non-trivial circumstances ;o)

Regards
Adrian

Tom Kyte
February 17, 2005 - 10:46 am UTC

merge, yes -- thanks.

you can MERGE with

a) just the insert part (not sure *why*, insert can already do that...)
b) just the update part.... (very useful)
c) both
d) an update with a conditional delete after the row is updated (eg: if you update a row AND the qty_on_hand goes to 0 -- delete the record...)

Am I doing anything incorrect here, please correct me.

Vin, February 17, 2005 - 12:46 pm UTC

Many Thanks for this solution.
Oracle Version 8174

SQL> Insert /*+ APPEND */ into t5 (a, b, w, x, y, z)
  2  Select t1.a, t1.b, t1.w, t2.x, t3.y, t4.z
  3    From t1, t2, t3, t4
  4   Where t1.a = t2.a
  5     And t1.b = t2.b
  6     And t1.a = t3.a
  7     And t1.b = t3.b
  8     And t1.a = t4.a
  9     And t1.b = t4.b;

2 rows created.
Commit complete.

SQL> Select * from t5;
         A          B W  X  Y  Z
---------- ---------- -- -- -- --
         1         11 t1 t2 t3 t4
         2         22 t1 t2 t3 t4

SQL> Select * from t1 Where a=1 and b=11;
         A          B W
---------- ---------- --
         1         11 t1

SQL> Update t1 set w='v' Where a=1 and b=11;

1 row updated.
Commit complete.

SQL> Select * from t1 Where a=1 and b=11;
         A          B W
---------- ---------- --
         1         11 v

SQL> Create global temporary table gtt(
a number,b number,w varchar2(2),
x varchar2(2),y varchar2(2),z varchar2(2),primary key(a,b));

Table created.

SQL> Insert into gtt (a, b, w, x, y, z)
  2  Select t1.a, t1.b, t1.w, t2.x, t3.y, t4.z
  3    From t1, t2, t3, t4
  4   Where t1.a = t2.a
  5     And t1.b = t2.b
  6     And t1.a = t3.a
  7     And t1.b = t3.b
  8     And t1.a = t4.a
  9     And t1.b = t4.b;

2 rows created.
Commit complete.

SQL> Update ( Select gtt.w new_w,
  2                  gtt.x new_x,
  3                  gtt.y new_y,
  4                  gtt.z new_z,
  5                  t5.w old_w,
  6                  t5.x old_x,
  7                  t5.y old_y,
  8                  t5.z old_z
  9             From gtt, t5
 10            Where gtt.a = t5.a
 11              And gtt.b = t5.b
 12         )
 13    Set old_w = new_w,
 14        old_x = new_x,
 15        old_y = new_y,
 16        old_z = new_z;

0 rows updated.  ===>Why 0 rows updated, when there is one 
Commit complete.     row updated in table t1.w.

SQL> Select * from t5;
         A          B W  X  Y  Z
---------- ---------- -- -- -- --
         1         11 t1 t2 t3 t4
         2         22 t1 t2 t3 t4
 

Tom Kyte
February 17, 2005 - 2:06 pm UTC

incomplete example, I cannot run it.

anyway, fortunately you left the "commit complete" in there even though you chopped out the (extremely relevant) code behind it.

what happens by default with global temporary tables upon commit?



ops$tkyte@ORA9IR2> create global temporary table gtt ( x int );
 
Table created.
 
ops$tkyte@ORA9IR2> insert into gtt values ( 1 );
 
1 row created.
 
ops$tkyte@ORA9IR2> select * from gtt;
 
         X
----------
         1
 
ops$tkyte@ORA9IR2> commit;
 
Commit complete.
 
ops$tkyte@ORA9IR2> select * from gtt;
 
no rows selected
 


 

I got it - Sorry to have wasted the white space and your time.

Vin, February 17, 2005 - 12:55 pm UTC

need to use this while creating the global temporary table 'on commit preserve rows'



Thanks

Anurag Mahesh, February 17, 2005 - 2:22 pm UTC

Tom - thanks for the excellend solution on my problem of bypassing duplicates from the flat file !! You are the best !!!

insert...select avoiding dups

Alex, April 21, 2005 - 10:26 am UTC

Tom,

I'm having a tough time trying to insert...select and avoiding unique constraint violations. I know all you have to do is add a condition to the select to grab records that don't already exist, but I can't get the syntax right. Given the following example, could you give a good way to do that? I was trying variations of

(t1.key1, t2.key2) not in (select key1, key2 from insert_table)

Not happening for me though.

insert into t (select t1.key1, t1.key2, t2.key3, col1
from table1 t1, table2 t2, table3 t3
where t1.key1 = t2.key1
and t1.key2 = t2.key2
and t1.key3 = t2.key3
and t2.key3 = t3.key3
and (where keys not in t....)

Much obliged.



Tom Kyte
April 22, 2005 - 8:49 am UTC

if you have a unique constraint on T(a,b,c)

it would be:

insert into t ( colums.... )
select columns....
from tables....
where .... joins ...
and (a,b,c) NOT IN (select a,b,c from t);


(you'll want to use the CBO, you'll want ONE of a,b,c in T to be "not null" at least -- defined in the dictionary as NOT NULL)

Perfect!

Nitin, January 13, 2010 - 10:10 pm UTC

After spending almost an hour trying to answer a similar question, I decided to search on AskTom. Lo and behold... here was the perfect answer to my problem.

Thank you!

More to Explore

Analytics

Analytic SQL got you confused? Check out Connor McDonald's complete video course.