Skip to Main Content

Breadcrumb

Question and Answer

Tom Kyte

Thanks for the question, Leah.

Asked: December 28, 2000 - 8:50 pm UTC

Last updated: October 21, 2011 - 8:18 pm UTC

Version: 8.1.5

Viewed 10K+ times! This question is

You Asked

Hi Tom,

I have a table T1 with columns C1, C2 and C3. Currently, there are duplicate records in T1 with the C1, C2 combination.
T1:
C1 C2 C3
-- -- --
1 50 a
1 50 b
2 89 x
2 89 y
2 89 z

I would like to delete the duplicate record on this combination and retain either value of C3. What is the best way of writing the PL/SQL block or query to do this?

Thanks in advance.
Leah

and Tom said...

scott@ORA8I.WORLD> create table t1 ( c1 int, c2 int, c3 char(1) );

Table created.

scott@ORA8I.WORLD>
scott@ORA8I.WORLD> insert into t1 values ( 1, 50, 'a' );

1 row created.

scott@ORA8I.WORLD> insert into t1 values ( 1, 50, 'b' );

1 row created.

scott@ORA8I.WORLD> insert into t1 values ( 2, 89, 'x' );

1 row created.

scott@ORA8I.WORLD> insert into t1 values ( 2, 89, 'y' );

1 row created.

scott@ORA8I.WORLD> insert into t1 values ( 2, 89, 'z' );

1 row created.

scott@ORA8I.WORLD>
scott@ORA8I.WORLD> delete from T1
2 where rowid <> ( select max(rowid)
3 from t1 b
4 where b.c1 = t1.c1
5 and b.c2 = t1.c2 )
6 /

3 rows deleted.

scott@ORA8I.WORLD> select * from t1;

C1 C2 C
---------- ---------- -
1 50 b
2 89 z


will do it.

Rating

  (72 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Duplicate values

mohammad taha, February 18, 2002 - 10:12 pm UTC

The question was like as follow:

I have a table T1 with columns C1, C2 and C3. Currently, there
are duplicate records in T1 with the C1, C2 combination.
T1:
C1 C2 C3
-- -- --
1 50 a
1 50 b
2 89 x
2 89 y
2 89 z

I would like to delete the duplicate record on this combination and retain either value of C3. What is the best way of writing the PL/SQL block or query to do this?

Here, if I want to keep both the values of C3 ( meaning 'a' and 'b' both), what should be done?

More precisely, I want to see the result like:


scott@ORA8I.WORLD> select * from t1;

C1 C2 C
---------- ---------- -

a
1 50 b
2 89 x
y
z





Tom Kyte
February 22, 2002 - 10:03 am UTC

Well, besides the fact that it sounds utterly "illogical", I would use SQLPlus to do this:

ops$tkyte@ORA817DEV.US.ORACLE.COM> create table t1 ( c1 int, c2 int, c3 char(1) );

Table created.

ops$tkyte@ORA817DEV.US.ORACLE.COM> 
ops$tkyte@ORA817DEV.US.ORACLE.COM> insert into t1 values ( 1, 50, 'a' );

1 row created.

ops$tkyte@ORA817DEV.US.ORACLE.COM> insert into t1 values ( 1, 50, 'b' );

1 row created.

ops$tkyte@ORA817DEV.US.ORACLE.COM> insert into t1 values ( 2, 89, 'x' );

1 row created.

ops$tkyte@ORA817DEV.US.ORACLE.COM> insert into t1 values ( 2, 89, 'y' );

1 row created.

ops$tkyte@ORA817DEV.US.ORACLE.COM> insert into t1 values ( 2, 89, 'z' );

1 row created.

ops$tkyte@ORA817DEV.US.ORACLE.COM> 
ops$tkyte@ORA817DEV.US.ORACLE.COM> <b>
ops$tkyte@ORA817DEV.US.ORACLE.COM> break on c1 on c2</b>
ops$tkyte@ORA817DEV.US.ORACLE.COM> 
ops$tkyte@ORA817DEV.US.ORACLE.COM> select c1, c2, c3 from t1 order by c1, c2;

        C1         C2 C
---------- ---------- -
         1         50 a
                      b
         2         89 x
                      y
                      z


 

Sorry for asking this here-finding rows with duplicate value in one column

lotus, May 08, 2003 - 10:22 am UTC

In my employee table, I have these four column (among others)
col Name Row 1 Row2
Seq Id 1 2
F_name 'Tom' 'Tom'
L _Name 'Kyte' 'Kyte'
Full_name 'Kyte, Tom' 'Kyte, Tom'
i have a need to find out the seq ids of all such employees who have similar full names.
I will highly appreciate your feedback.
Thanks & best regards,



Tom Kyte
May 08, 2003 - 6:13 pm UTC

part of our warehouse building tool is a thing called pure integrate that does this name merging (which is a complex algorith at best)....

You say "similar" but the above are identical. identical is easy (simple sql). similar takes a larger tool -- part of our data warehousing offering.

Sorry I should have mentioned it Identical

A reader, May 09, 2003 - 11:54 am UTC

Yes, I am looking for records with identical names.
Thanks


Tom Kyte
May 09, 2003 - 2:57 pm UTC

select *
from t
where (columns-to-be-compared) in (select columns-to-be-compared
from t
group by columns-to-be-compared
having count(*) > 1 )

Many Thanks!!

A reader, May 09, 2003 - 3:20 pm UTC


so simple...

Aries, May 12, 2003 - 7:15 am UTC

this is cool... but i have some solutions you would want to assess. i run this with less LIO and much much faster... what do you think sir?

delete from t1 where
rowid in ( select rowid from t1 group by rowid, col1, col2, coln
minus
select min(rowid) from t1 group by col1, col2,coln);

...
assuming we got duplicates in col1, col2,coln.

Tom Kyte
May 12, 2003 - 8:04 am UTC

sure, works great.

so simple...

Aries, May 12, 2003 - 11:49 pm UTC

whew! cool... *salutes*

i simple 'nod' from the guru would be an honor to me.
sincerely, i am really learning a lot from this site...

may this site go a long long way... and i patiently waiting for the book Tom.



Rory, May 13, 2003 - 5:18 am UTC

Great solutions.... c",)

Mark, May 30, 2003 - 9:55 am UTC

How about like this.


delete from t1 where
rowid in ( select MAX(rowid) from t1 group by col1, col2, coln having count(1) > 1);

Where your duplicates are accoss col1, col2..coln






Tom Kyte
May 30, 2003 - 10:18 am UTC

Nope, think about it.

if you have:

table t:

X
-------
1
1
1
1


how many times would you have to run your delete to de-dup that?

A reader, May 30, 2003 - 11:35 am UTC


Deleting Duplicate Records

JAMIL, September 22, 2003 - 9:38 am UTC

Dear sir,
    
   I have this table for creating the items definitions
and the structure of the table as follow:

SQL> descr inventory
 Name                            Null?    Type
 ------------------------------- -------- ----
ITEM_CODE                                VARCHAR2(30)   PRIMARY KEY NOT NULL 
ITEM_NAME_E                              VARCHAR2(100)
STORE_CODE                               VARCHAR2(10)
ITEM_CLASS                               VARCHAR2(10)

The ITEM_CODE is generated automatically by the system, but I have duplicate records in the ITEM_NAME_E , because the item name is enter manually by the user ,so I would like to delete all the duplicate records and keep  only the oldest ITEM_CODE  , for example 

SELECT ITEM_CODE,SUBSTR(ITEM_NAME_E,1,20),ITEM_CLASS
    FROM JAM_INVENTORY A
    WHERE ITEM_NAME_E IN (SELECT ITEM_NAME_E FROM JAM_INVENTORY B
      WHERE ITEM_CLASS = 125
    GROUP BY ITEM_NAME_E
    HAVING COUNT(*) > 1)
    ORDER BY ITEM_NAME_E,ITEM_CODE
/

ITEM_CODE  SUBSTR(ITEM_NAME_E ITEM_CLASS
---------- ---------------------------------------
1251050075    BOX 120*60      125-> not to be delete it 
1251050104    BOX 120*60      125-> to be delete it 
1251050135    BOX 120*60      125-> to be delete it 
1251050165    BOX 120*60      125-> to be delete it 
1251050196    BOX 120*60      125-> to be delete it 
1251050197    BOX 120*60      125-> to be delete it 
1251050198    BOX 120*60      125-> to be delete it 
1251050076    BOX 120*70      125-> not to be delete it
1251050105    BOX 120*70      125-> to be delete it
1251050166    BOX 120*70      125-> to be delete it
1251050199    BOX 120*70      125 -> to be delete it

I have 54536 records in this table, would you please give  me an example how to delete the duplicate records with fast speed.
Thank you very much in advance 

Jamil  

 

Tom Kyte
September 22, 2003 - 10:53 am UTC

delete from t
where item_code <> ( select max(item_code)
from t t2
where t2.item_name_e = t.item_name_e );

an index on (item_name_e,item_code) would be useful for making this "faster"

How to get this

pawan, September 29, 2003 - 1:54 pm UTC

Thanks for all the help.
Have a table with the following records

col1 col2 Desc
---- ---- -------------------------
C08 04 REFUSED - INCORRECT ORDER
C09 04 REFUSED - SHIPPING ERROR
C10 04 REFUSED - DUPLICATE ORDER
C11 04 REFUSED - NO REASON
C12 04 RECEIVED LATE
C13 04 ORDER CANCELLED
C14 04 REFUSED - DAMAGED IN TRANSIT
C15 06 WRONG MERCHANDISE
C16 07 DISCONTINUED BY SUPPLIER
C17 07 DISCONTINUED BY CUSTOMER
C18 08 RECALLS

When I run the following query (t1.col2 in ('04','06','07','08')) is just to simplify

select * from cso.reason_codes_xref t1
where col1 in ( select min(col1)
from cso.reason_codes_xref t2
where t2. col2 = t1. col2)
and t1.col2 in ('04','06','07','08')

I get:
col1 col2 Desc
---- ---- -------------------------
C08 04 REFUSED - INCORRECT ORDER
C15 06 WRONG MERCHANDISE
C16 07 DISCONTINUED BY BMS
C18 08 RECALLS

I want to have distinct col1,col2 and concatenation of all the "desc" for a particular col2.

i.e.
C08 04 REFUSED - INCORRECT ORDER||REFUSED - SHIPPING ERROR||REFUSED - DUPLICATE ORDER...
C15 06 WRONG MERCHANDISE
C16 07 DISCONTINUED BY SUPPLIER||DISCONTINUED BY CUSTOMER
C18 08 RECALLS

Tom Kyte
September 29, 2003 - 2:50 pm UTC

search this site for


STRAGG

SELECTING duplicate records

Jamil, December 02, 2003 - 6:33 am UTC

Dear Sir
I would like to get the duplicate item by using this Query , but it is not giving the wright result .


SELECT ITEM_CODE,SUBSTR(ITEM_NAME_E,1,40)
FROM PROJ.IM_INVENTORY
WHERE (ITEM_LENGTH,ITEM_WIDTH,ITEM_HIGHT1,ITEM_HIGHT2)
IN (SELECT ITEM_LENGTH,ITEM_WIDTH,ITEM_HIGHT1,ITEM_HIGHT2
FROM PROJ.IM_INVENTORY
WHERE ITEM_CLASS =125
GROUP BY ITEM_NAME_E,ITEM_CLASS,ITEM_LENGTH,ITEM_WIDTH,ITEM_HIGHT1,ITEM_HIGHT2
HAVING COUNT(*) > 1)

how can I get the duplicate records by checking Multiple fields ,can give an example.
thank you for your valuable answer .

Best regards
Jamil

Tom Kyte
December 02, 2003 - 8:43 am UTC

your query works to find dups as far as I can tell.

ops$tkyte@ORA920PC> select * from im_inventory;
 
ITEM_CODE ITEM_NAME_E ITEM_LENGTH ITEM_WIDTH ITEM_HIGHT1 ITEM_HIGHT2 ITEM_CLASS
--------- ----------- ----------- ---------- ----------- ----------- ----------
        1           1           1          1           1           1        125
        1           1           1          1           1           1        125
        1           1           1          2           1           1        125
        1           1           1          3           1           1        125
        1           1           1          3           1           1        126
        1           1           1          3           1           1        126
        1           1           1          3           1           1        126
 
7 rows selected.
 
ops$tkyte@ORA920PC> SELECT ITEM_CODE,SUBSTR(ITEM_NAME_E,1,40)
  2      FROM IM_INVENTORY
  3      WHERE (ITEM_LENGTH,ITEM_WIDTH,ITEM_HIGHT1,ITEM_HIGHT2)
  4  IN (SELECT  ITEM_LENGTH,ITEM_WIDTH,ITEM_HIGHT1,ITEM_HIGHT2
  5              FROM IM_INVENTORY
  6             WHERE ITEM_CLASS =125
  7              GROUP BY
  8  ITEM_NAME_E,ITEM_CLASS,ITEM_LENGTH,ITEM_WIDTH,ITEM_HIGHT1,ITEM_HIGHT2
  9              HAVING COUNT(*) > 1)
 10  /
 
 ITEM_CODE SUBSTR(ITEM_NAME_E,1,40)
---------- ----------------------------------------
         1 1
         1 1
 


you must just not "have any to find" 

selecting the duplicate records

Jamil Shaibani, December 03, 2003 - 5:31 am UTC

Dear Sir
Thank you very much for your help, As you said the query it works fine
But I have done some modification to the query as follow:

SELECT A.ITEM_CODE,SUBSTR(A.ITEM_NAME_E,1,40)
FROM IM_INVENTORY A
WHERE (A.ITEM_LENGTH,A.ITEM_WIDTH,A.ITEM_HIGHT1,A.ITEM_HIGHT2)
IN (SELECT B.ITEM_LENGTH,B.ITEM_WIDTH,B.ITEM_HIGHT1,B.ITEM_HIGHT2
FROM IM_INVENTORY B
WHERE A.ITEM_CLASS = B.ITEM_CLASS
GROUP BY
A.ITEM_NAME_E,A.ITEM_CLASS,B.ITEM_LENGTH,B.ITEM_WIDTH,B.ITEM_HIGHT1,B.ITEM_HIGHT2
HAVING COUNT(*) > 1)
ORDER BY

A.ITEM_LENGTH,A.ITEM_WIDTH,A.ITEM_HIGHT1,A.ITEM_HIGHT2

So this query works fine if the table contains is few records , but my current table contains 56077 records , so the query takes a lot of time , so I am waiting for the query to return the result since four hours but still not yet finished , also I have created the index for the table as follow :

CREATE INDEX INV_IND1 ON IM_INVENTORY (ITEM_CLASS,ITEM_LENGTH,ITEM_WIDTH,ITEM_HIGHT1,ITEM_HIGHT2)

How can I make this query faster?

Best regards
Jamil



Tom Kyte
December 03, 2003 - 7:12 am UTC

are you using the rbo or cbo, what query plan do you have. (hint use CBO if you want to go fast)

SELECTING THE DUPLICATE RECORDS

Jamil, December 04, 2003 - 12:15 am UTC

Dear Sir
I do not have any idea about RBO AND CBO can you explain in more details what is RBO AND CBO , I am running my query in
Oracle9i Enterprise Edition Release 9.2.0.1.0 - Production
With the Partitioning, Oracle Label Security, OLAP and Oracle Data Mining options
JServer Release 9.2.0.1.0 &#8211; Production

Best regards
Jamil


Tom Kyte
December 04, 2003 - 7:46 am UTC

Ahh, ok, your reading assignment for the weekend:

a) read the concepts guide, the most important document about oracle you'll ever read.

b) read the performance and tuning guide, it'll do the deep dive on the optimizer

</code> https://docs.oracle.com <code>

do NOT skip step a)

it is the most important thing you can do for your Oracle career.

mani

Mani, March 17, 2004 - 12:20 pm UTC

Hi Tom,
I read the post for duplicates but I see examples with
row-column combination duplicates.

In my case the, I need to identify duplicates which can belong to different columns but same row.

I have a table T with columns C1, C2, C3 and C4.
T1:
row_id C1 C2 C3 C4
-- -- -- -- --
1 1 50 121 789
2 2 39 50 650
3 1 21 39 50
4 77 89 451 455
5 87 89 898 999

I want to identify duplicate row values with >= 2 row data match.
e.g.
1) for row_id = 2
C2 = 39 and C3 = 50
and in row_id = 3
C3 = 39 and C4 = 50

2) for row_id = 1
C1 = 1 and C2 = 50
and in row_id = 3
C1 = 1 and C4 = 50

In this case 2 rows had 2 values duplicate but they belonged to different columns.

What is the best way of writing the PL/SQL block or query to do this?

I know this is very wierd requirement, Please help I have no other choice.

Thanks in advance.
Mani

Tom Kyte
March 17, 2004 - 3:20 pm UTC

ugh, a cartesian join. You do understand that every row must be compared to every row. If you have 10,000 rows that is 100,000,000 compares.

still want to do it? (we can)

yes please, I am aware of cartesian

Mani, March 17, 2004 - 4:24 pm UTC

Hi Tom,
Infact my actual table has 6 cols, the other has 4.
If you can please show an example with 6 cols , that would be great!. Even if the solution is not scalable to 4 cols thats ok. I can rewrite for 4 cols understanding for 6 cols example.
I do not have huge data so cartesian will not affect me that bad, but still I cannot do this manually , its very error prone and extremely time consuming.

Again I am sorry to ask solutions for wierd requirements like this.

Thanks always

Tom Kyte
March 17, 2004 - 4:34 pm UTC

ops$tkyte@ORA9IR2> select t1.row_id, t2.row_id
  2    from t t1, t t2
  3 where case when t1.c1 in ( t2.c1, t2.c2, t2.c3, t2.c4 ) then 1 else 0 end +
  4       case when t1.c2 in ( t2.c1, t2.c2, t2.c3, t2.c4 ) then 1 else 0 end +
  5       case when t1.c3 in ( t2.c1, t2.c2, t2.c3, t2.c4 ) then 1 else 0 end +
  6   case when t1.c4 in ( t2.c1, t2.c2, t2.c3, t2.c4 ) then 1 else 0 end > = 2
  7     and t1.row_id < t2.row_id
  8  /
 
    ROW_ID     ROW_ID
---------- ----------
         1          3
         2          3



to make it for 6 should be easy... just add to the t2.c... list and add two more "cases"

I tried this with your set of data, 1000, 2000 and 3000 rows and found

1000 = 0.87 seconds
2000 = 3.66 seconds
3000 = 8.42 seconds

as you can see - about 1 second per thousand rows :)

Not a total cartesian join with the "t1.row_id < t2.row_id" in there but... 

You made my life so easy

malini, June 03, 2004 - 2:38 pm UTC

I had a table with no primary key and by mistake inserted about 40000 rows twice.I was stuck. This solution helped me get going in no time .
Thanks again,


Deleting duplicates

A reader, October 13, 2004 - 11:15 pm UTC

create table t (pk int,i int,j int,k int);
insert into t values(1,2,3,4);
insert into t values(2,4,5,6);
insert into t values(3,6,7,8);


create table tt (pk int,the_type varchar2(1),the_value);
insert into tt values (1,'I',2);
insert into tt values (1,'J',3);
insert into tt values (1,'K',4);

insert into tt values (2,'I',4);
insert into tt values (2,'K',6);

insert into tt values (3,'I',6);
insert into tt values (3,'J',7);
insert into tt values (3,'K',9);

i.e. One row in t generates upto 3 rows in tt.

Question:
1. How can I delete rows in t if all 3 matching rows are present in tt?

In the above example, the pk=1 row should get deleted since all its 3 "expanded" rows are present in t with matching values.

pk=2 should NOT get deleted because it has only 2 matching rows

pk=3 should NOT get deleted because even though it has 3 matching rows in tt, the values dont match (t.k=8 while tt.the_value for 'K' is 9)

[In reality, tt is 1000s of times larger than t in terms of no of rows]

Help? Thanks

Tom Kyte
October 14, 2004 - 9:25 am UTC

ops$tkyte@ORA9IR2> select *
  2    from (
  3  select tt.pk, tt.rowid rid, count(*) over (partition by tt.pk) cnt
  4    from tt, t
  5   where tt.pk = t.pk
  6     and tt.the_value in ( t.i, t.j, t.k )
  7         )
  8   where cnt = 3
  9  /
 
        PK RID                       CNT
---------- ------------------ ----------
         1 AAAKKxAAJAAAAP+AAA          3
         1 AAAKKxAAJAAAAP+AAB          3
         1 AAAKKxAAJAAAAP+AAC          3
 
ops$tkyte@ORA9IR2> select * from tt;
 
        PK T  THE_VALUE
---------- - ----------
         1 I          2
         1 J          3
         1 K          4
         2 I          4
         2 K          6
         3 I          6
         3 J          7
         3 K          9
 
8 rows selected.
 
ops$tkyte@ORA9IR2> delete from tt
  2  where rowid in (
  3  select rid
  4    from (
  5  select tt.rowid rid, count(*) over (partition by tt.pk) cnt
  6    from tt, t
  7   where tt.pk = t.pk
  8     and tt.the_value in ( t.i, t.j, t.k )
  9         )
 10   where cnt = 3
 11  )
 12  /
 
3 rows deleted.
 
ops$tkyte@ORA9IR2> select * from tt;
 
        PK T  THE_VALUE
---------- - ----------
         2 I          4
         2 K          6
         3 I          6
         3 J          7
         3 K          9


<b>You don't say what should happen if there were extra "pk=1" rows in there that either duplicated the "the_value" column or had "the_value" columns not in T, so I sort of ignore them.</b>
 

the other table needs to have the records deleted ...

Gabe, October 14, 2004 - 10:21 am UTC

<quote>How can I delete rows in t if ...</quote>
^^^
Rowid needs to come from T + "select distinct rid" (although will work without it).

The other thing you maybe should've warned of are the NULLs ... the following pk=4 will get deleted although they certainly don't _look_ like a ?match?

insert into t values (4,1,null,1);

insert into tt values (4,'I',1);
insert into tt values (4,'J',1);
insert into tt values (4,'K',1);

... and the following pk=5 will not get deleted although they certainly _look_ like a ?match?

insert into t values (5,1,null,1);

insert into tt values (5,'I',1);
insert into tt values (5,'J',null);
insert into tt values (5,'K',1);




Tom Kyte
October 14, 2004 - 10:52 am UTC


doh, right on both -- t.rowid, not tt.rowid and

c/count(*)/count(distinct the_value)/

thanks for that eyeball there, you keep me honest :)


(ps:

Gabe -- do you want to enter the exciting, fun filled world of technical editing? if so, I think I can swing a slot -- you have a really good eye and aren't afraid to point out things.... pay is bad, hours are long -- but you get a box of books at the end. email me if you are interested....)




A reader, October 14, 2004 - 4:54 pm UTC

ops$tkyte@ORA9IR2> delete from tt
2 where rowid in (
3 select rid
4 from (
5 select tt.rowid rid, count(*) over (partition by tt.pk) cnt
6 from tt, t
7 where tt.pk = t.pk
8 and tt.the_value in ( t.i, t.j, t.k )
9 )
10 where cnt = 3
11 )
12 /

Where do you take my "the_type" column into account here? The insert into t values(1,2,3,4); row in t has a match in tt only if tt.the_value for the_type='I' is 2, value for tt.the_value for the_type='J' is 3 and tt.the_value for the_type='K' is 4.

Your IN on line 8 above doesnt make this distinction?

Or maybe I am missing something?

Tom Kyte
October 14, 2004 - 7:51 pm UTC

didn't catch the 'subtlety' of the ijk -- looks like you figured it out as below as a variation of this technique.

A reader, October 14, 2004 - 5:17 pm UTC

Something like

delete from t
where rowid in (
select rid
from (
select t.*,t.rowid rid,tt.the_value, count(*) over (partition by tt.pk) cnt
from tt, t
where tt.pk = t.pk
and case the_type
when 'I' then t.i
when 'J' then t.j
when 'K' then t.k
end = tt.the_value
)
where cnt = 3
)

seems to do what I want

Is this going to scale well? How would you do it?

Thanks

Tom Kyte
October 14, 2004 - 7:52 pm UTC

it is as good as anything else -- nice big full scans, hoping for some nice hash joins.

A reader, October 14, 2004 - 9:08 pm UTC

and case the_type
when 'I' then t.i
when 'J' then t.j
when 'K' then t.k
end = tt.the_value

In reality, I have 3 columns for each value of I/J/K i.e. i_1,i_2 and i_3, j_1,...k_3

How can I do the above with CASE i.e. CASE can return only 1 expression.

So I would have to do

and case the_type
when 'I' then t.i_1
when 'J' then t.j_1
when 'K' then t.k_1
end = tt.value_1
and case the_type
when 'I' then t.i_2
when 'J' then t.j_2
when 'K' then t.k_2
end = tt.value_2
and case the_type
when 'I' then t.i_3
when 'J' then t.j_3
when 'K' then t.k_3
end = tt.value_3

Is there a way to write the above more succinnctly?

Tom Kyte
October 15, 2004 - 10:58 am UTC

Ok, now that we've gone back and forth and I completely get the entire picture, try this:


ops$tkyte@ORA9IR2> select pk,
  2         max(decode(the_type,'I',the_value)) i,
  3         max(decode(the_type,'J',the_value)) j,
  4         max(decode(the_type,'K',the_value)) k
  5    from tt
  6   where the_type in ( 'I', 'J', 'K' )
  7   group by pk
  8  /
 
        PK          I          J          K
---------- ---------- ---------- ----------
         1          2          3          4
         2          4                     6
         3          6          7          9
 
<b>Pivot the data so TT looks like T (this assumes PK,THE_TYPE is UNIQUE in TT!!! to work properly, hope that is true...)</b>

ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2> select t.rowid rid
  2    from t
  3   where (pk,i,j,k) in (
  4  select pk,
  5         max(decode(the_type,'I',the_value)) i,
  6         max(decode(the_type,'J',the_value)) j,
  7         max(decode(the_type,'K',the_value)) k
  8    from tt
  9   where the_type in ( 'I', 'J', 'K' )
 10   group by pk );
 
RID
------------------
AAAHuhAAHAAAACOAAA
 
<b>that is the row we want to delete</b>

ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2> delete from t where rowid in
  2  (
  3  select t.rowid rid
  4    from t
  5   where (pk,i,j,k) in (
  6  select pk,
  7         max(decode(the_type,'I',the_value)) i,
  8         max(decode(the_type,'J',the_value)) j,
  9         max(decode(the_type,'K',the_value)) k
 10    from tt
 11   where the_type in ( 'I', 'J', 'K' )
 12   group by pk )
 13  )
 14  /
 
1 row deleted.
 
<b>that deleted it... Now, if there are MANY pk values in TT that won't be in T, you might also add:</b>


ops$tkyte@ORA9IR2> rollback;
 
Rollback complete.
 
ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2> delete from t where rowid in
  2  (
  3  select t.rowid rid
  4    from t
  5   where (pk,i,j,k) in (
  6  select pk,
  7         max(decode(the_type,'I',the_value)) i,
  8         max(decode(the_type,'J',the_value)) j,
  9         max(decode(the_type,'K',the_value)) k
 10    from tt
 11   where the_type in ( 'I', 'J', 'K' )<b>
 12     and pk in ( select pk from t )</b>
 13   group by pk )
 14  )
 15  /
 
1 row deleted.
 
 

How to delete duplicate records from a non database block in forms 6i

Navdeep Singh Saini, November 16, 2004 - 2:54 am UTC

Hi

I am using Forms 6i, how can i delete duplicate records from a non database block.

Thanks
Navdeep

Tom Kyte
November 16, 2004 - 6:39 am UTC

procedurally, brute force.

a double loop probably is easiest if you have a small enough number of records. i don't remember the forms APIs but you have the ability to address the i'th record in there somehow. just loop over each record and in an inner loop, loop over the records that come after this one -- looking for a dup. if you find it, delete record.

deleteing Duplicate records from non database block in forms

A reader, November 18, 2004 - 1:24 am UTC

I have tried this method, is there any other method available.
I have taken three fields from the first record and stored them in variables and then compares these three variables with the records coming after the first record, if it finds the duplicate record it deletes the duplicate record. then it moves to the second record stores it in variables and does the same thing. But what can i do when there are a lot of columns( fields) in the block. Is there any way to declare an array in forms. Remember it is a non database block

Thanks

Tom Kyte
November 18, 2004 - 10:27 am UTC

procedural code is procedural code, you are just writing an algorithm to work on a record group? very straight forward, but brute force.

there is no magic here, it is called "code"

Thanks .......I just did it

A reader, November 19, 2004 - 1:54 am UTC

Thanks very much i did it by using PL/SQL record

Thanks

Why delete statement is taking very long time ?

Parag Jayant Patankar, November 30, 2004 - 1:46 am UTC

Hi Tom,

SQL> select count(*) from vstfz3s;

  COUNT(*)
----------
      6679

I am trying to delete duplicate records from this table if any by following SQL command 

prompt Delete
delete 
from vstfz3s a
where rowid <> ( select max(rowid)
                 from   vstfz3s b
                 where  b.g4000 = a.g4000 and b.g4300 = a.g4300 );
commit;

But this delete is taking more than 1 Hr and 15 mins. Can you pl tell me 

1. Why it is taking a such long time ?
2. How to improve this delete statment in SQL ? 
( without creating additonal tables or pl/sql ..etc )

regards & thanks
pjp
 

Tom Kyte
November 30, 2004 - 7:36 am UTC

1) no, i cannot tell you why. I can guess why.

perhaps there is no index on g4000,g4300. perhaps it is full scanning vstfz3s 6,679 times (processing some 45million rows)

perhaps there is an index on g4000,g4300 and you are waiting for 6,679 index range scans to complete on a busy machine.

perhaps someone has a lock on a row and you were just sitting there waiting for em.

perhaps you have 500 indexes on this table.

those would be among the reasons....... with the information you've supplied, no one could say for sure.

2) i like to

delete from t where rowid IN
( select rid
from
(select rowid rid,
row_number() over (partition by g4000, g4300 order by rowid) rn
from t )
where rn <> 1 );

that'll full scan t once to generate the set of rowids to delete and then delete them.




Another dedup solution

Dan, March 18, 2005 - 4:32 pm UTC

Great, great thread - I've learned a lot about removing duplicate records.

Out of all the solutions (main in the beginning of this thread), I found this one to be the fastest yet.  What do you think?

create table some_table (
     col1 varchar2(100)
    ,col2 varchar2(100)
    ,col3 varchar2(100)
);

Table created.

SQL> insert into some_table select owner, object_name, object_type from dba_objects;

275314 rows created.

SQL> commit;

Commit complete.

SQL> exec dbms_stats.gather_table_stats('apps', 'some_table');

PL/SQL procedure successfully completed.

SQL> explain plan for
  2  delete from some_table
  3  where rowid in (
  4    select t1.rowid
  5    from    some_table t1, some_table t2
  6    where   t1.col1 = t2.col1
  7    and     t1.col2 = t2.col2
  8    and     t1.col3 = t2.col3
  9    and     t1.rowid < t2.rowid
 10  );

Explained.

SQL> @utlxpls

PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------
-----------

----------------------------------------------------------------------------
| Id  | Operation                    |  Name       | Rows  | Bytes | Cost  |
----------------------------------------------------------------------------
|   0 | DELETE STATEMENT             |             |     4 |    56 |   888 |
|   1 |  DELETE                      | SOME_TABLE  |       |       |       |
|   2 |   NESTED LOOPS               |             |     4 |    56 |   888 |
|   3 |    VIEW                      | VW_NSO_1    |     4 |    28 |   860 |
|   4 |     SORT UNIQUE              |             |     4 |   280 |       |
|   5 |      HASH JOIN               |             |     4 |   280 |   860 |
|   6 |       TABLE ACCESS FULL      | SOME_TABLE  |   275K|  9410K|   233 |
|   7 |       TABLE ACCESS FULL      | SOME_TABLE  |   275K|  9410K|   233 |
|   8 |    TABLE ACCESS BY USER ROWID| SOME_TABLE  |     1 |     7 |     1 |
----------------------------------------------------------------------------

Note: cpu costing is off, PLAN_TABLE' is old version

16 rows selected.


SQL> delete from some_table
  2  where rowid in (
  3    select t1.rowid
  4    from    some_table t1, some_table t2
  5    where   t1.col1 = t2.col1
  6    and     t1.col2 = t2.col2
  7    and     t1.col3 = t2.col3
  8    and     t1.rowid < t2.rowid
  9  );

785 rows deleted. 

Tom Kyte
March 18, 2005 - 8:11 pm UTC

looks fine to me.

(you are sure that is faster than using row_number()?)

Dan, March 18, 2005 - 11:48 pm UTC

<quote>
(you are sure that is faster than using row_number()?)
</quote>

Yes, at least for one of our "real" tables. I'm sure there are scenerios where row_number() would come out ahead.

apps@ECTSDV> update xxcts_ent_okc_k_items_stg s
2 set process_flag = 'D'
3 where rowid in (
4 select rid
5 from
6 ( select rowid rid,
7 row_number() over (partition by cle_id, new_object1_id1, jtot_object1_code, dnz_chr_id order by rowid) rn
8 from xxcts_ent_okc_k_items_stg )
9 where rn <> 1
10 );

17 rows updated.

Elapsed: 00:00:39.05

Execution Plan
----------------------------------------------------------
0 UPDATE STATEMENT Optimizer=CHOOSE (Cost=15870 Card=1315862 Bytes=21053792)
1 0 UPDATE OF 'XXCTS_ENT_OKC_K_ITEMS_STG'
2 1 HASH JOIN (SEMI) (Cost=15870 Card=1315862 Bytes=21053792)
3 2 INDEX (FULL SCAN) OF 'XXCTS_ENT_OKC_K_ITEMS_STG_N01' (NON-UNIQUE) (Cost=10320 Card=1315862 Byt
es=11842758)

4 2 VIEW OF 'VW_NSO_1' (Cost=14037 Card=1315862 Bytes=9211034)
5 4 VIEW (Cost=14037 Card=1315862 Bytes=26317240)
6 5 WINDOW (SORT) (Cost=14037 Card=1315862 Bytes=57897928)
7 6 TABLE ACCESS (FULL) OF 'XXCTS_ENT_OKC_K_ITEMS_STG' (Cost=3796 Card=1315862 Bytes=5789792
8)





Statistics
----------------------------------------------------------
0 recursive calls
146 db block gets
51350 consistent gets
49238 physical reads
15920 redo size
456 bytes sent via SQL*Net to client
713 bytes received via SQL*Net from client
4 SQL*Net roundtrips to/from client
2 sorts (memory)
1 sorts (disk)
17 rows processed

apps@ECTSDV> rollback;

Rollback complete.

apps@ECTSDV> update xxcts_ent_okc_k_items_stg s
2 set process_flag = 'D'
3 where rowid in (
4 select t1.rowid
5 from xxcts_ent_okc_k_items_stg t1, xxcts_ent_okc_k_items_stg t2
6 where t1.cle_id = t2.cle_id
7 and t1.new_object1_id1 = t2.new_object1_id1
8 and t1.jtot_object1_code = t2.jtot_object1_code
9 and t1.dnz_chr_id = t2.dnz_chr_id
10 and t1.rowid < t2.rowid
11 );

17 rows updated.

Elapsed: 00:00:24.09

Execution Plan
----------------------------------------------------------
0 UPDATE STATEMENT Optimizer=CHOOSE (Cost=9857 Card=1 Bytes=16)
1 0 UPDATE OF 'XXCTS_ENT_OKC_K_ITEMS_STG'
2 1 NESTED LOOPS (Cost=9857 Card=1 Bytes=16)
3 2 VIEW OF 'VW_NSO_1' (Cost=9832 Card=1 Bytes=7)
4 3 SORT (UNIQUE)
5 4 HASH JOIN (Cost=9832 Card=1 Bytes=88)
6 5 TABLE ACCESS (FULL) OF 'XXCTS_ENT_OKC_K_ITEMS_STG' (Cost=3796 Card=1315862 Bytes=5789792
8)

7 5 TABLE ACCESS (FULL) OF 'XXCTS_ENT_OKC_K_ITEMS_STG' (Cost=3796 Card=1315862 Bytes=5789792
8)

8 2 TABLE ACCESS (BY USER ROWID) OF 'XXCTS_ENT_OKC_K_ITEMS_STG' (Cost=1 Card=1 Bytes=9)




Statistics
----------------------------------------------------------
0 recursive calls
133 db block gets
50065 consistent gets
64003 physical reads
15544 redo size
458 bytes sent via SQL*Net to client
809 bytes received via SQL*Net from client
4 SQL*Net roundtrips to/from client
2 sorts (memory)
0 sorts (disk)
17 rows processed


Delete duplicate

Huy, June 20, 2005 - 11:20 am UTC

I have a table, my_table, with columns (contact_id, email, blah) and I want to remove duplicate email addresses according to the following rule:

_If a contact has email addresses that are all the same, keep one, delete all others.

_If a contact has some email addresses that are the same and some that are different, keep them all.

The following SQL seems to work

delete from my_table t
where rowid <> (select max(rowid)
from my_table u
where u.contact_id = t.contact_id)
and t.contact_id in (--multiple emails that are the same
select v.contact_id
from my_table v
group by v.contact_id
having count(distinct v.email) = 1 and count(v.email) > 1);

But I wonder if there is something more elegant?

Thanks


Tom Kyte
June 20, 2005 - 12:48 pm UTC

I'd use analytics for this to retrieve the set of rows to remove, consider:

ops$tkyte@ORA9IR2> create table t ( cid number, email varchar2(5), etc number );
 
Table created.
 
ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2> insert into t select 1, 'xxxx', rownum from all_users where rownum <= 5;
 
5 rows created.
 
ops$tkyte@ORA9IR2> insert into t select 2, 'x'||rownum, rownum from all_users where rownum <= 5;
 
5 rows created.
 
ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2> select cid, email,
  2         row_number() over (partition by cid order by email) rn,
  3             count(distinct email) over (partition by cid) cnt
  4    from t;
 
       CID EMAIL         RN        CNT
---------- ----- ---------- ----------
         1 xxxx           1          1
         1 xxxx           2          1
         1 xxxx           3          1
         1 xxxx           4          1
         1 xxxx           5          1
         2 x1             1          5
         2 x2             2          5
         2 x3             3          5
         2 x4             4          5
         2 x5             5          5
 
10 rows selected.
 
<b>your rule is -- when cnt = 1 (one email) remove all rows such that rn > 1, eg these rows:</b>

ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2> select *
  2    from (
  3  select rowid rid,
  4         cid, email,
  5         row_number() over (partition by cid order by email) rn,
  6             count(distinct email) over (partition by cid) cnt
  7    from t
  8         )
  9   where cnt = 1 and rn > 1
 10  /
 
RID                       CID EMAIL         RN        CNT
------------------ ---------- ----- ---------- ----------
AAAIT4AAJAAAAAVAAB          1 xxxx           2          1
AAAIT4AAJAAAAAVAAC          1 xxxx           3          1
AAAIT4AAJAAAAAVAAD          1 xxxx           4          1
AAAIT4AAJAAAAAVAAE          1 xxxx           5          1

<b>and this does that:</b>

 
ops$tkyte@ORA9IR2> delete from t where rowid in
  2  (
  3  select rid
  4    from (
  5  select rowid rid,
  6         cid, email,
  7         row_number() over (partition by cid order by email) rn,
  8             count(distinct email) over (partition by cid) cnt
  9    from t
 10         )
 11   where cnt = 1 and rn > 1
 12  )
 13  /
 
4 rows deleted.
 
 

Thanks

Huy, June 21, 2005 - 4:13 am UTC

Thanks for a great solution and especially for your time.


duplicate records

mohan, October 18, 2005 - 9:11 am UTC

Hi Tom,

How to delete the rows using analytical function?
You can identify duplicate records using below query .
select * from (select empno,ename,sal,deptno,row_number() over(partition by empno,deptno order by deptno)rn from emp ) where rn<>1
/
Thanks
Mohan

Tom Kyte
October 18, 2005 - 9:23 am UTC

Need more help

Prashant, January 13, 2006 - 5:16 am UTC

Hi Tom,
I have data as like following

Table Temp

ORDER_NUMBER HEADER_ID LINE_ID
99000003 32546 100590
99000003 32546 100590
99000003 32546 100590
99000003 32546 100590

i want to delete the duplicate record in this

Tom Kyte
January 13, 2006 - 11:13 am UTC

hmm, read this page?

Millions of duplicate records

A reader, February 09, 2006 - 3:18 pm UTC

I have about a partitioned table with 57 million records. it has millions of duplicate records. what is the best way to delete them. I tried bulk collect and insert into new table. it is taking forever and how can I see rejected records?

Tom Kyte
February 10, 2006 - 11:36 am UTC

is millions TENS of millions or 2 million.

If many of the records will be deleted, I would prefer to select out and keep the records of interest


create table new_table
as
select ....
from (select a.*, row_number() over (partition by KEY_FIELDS order by rowid) rn
from old_table a)
where rn = 1;


If few of the records will be deleted, I would remove them like this:


delete from t
where rowid in
(select rid
from (select rowid rid,
row_number() over (partition by KEY_FIELDS
order by rowid) rn
from t )
where rn <> 1 );



Fabulous!

Nafeesah Hines, February 21, 2006 - 9:14 pm UTC

EXACTLY what I was looking for. Kudos Tom!!

Complex SQL Query - Find Duplicate Records.

VIKAS, April 03, 2006 - 4:24 am UTC

Dear Mr Kyte,

I have to write a Query (Basically a Procedure) that selects duplicate records from a Table has more than 15 million records staisfying the following conditions:-

A record is duplicate if the below given columns of the table have same/identical values, i.g.:-

(either(firstname and phone)
or (firstname and officephone)are same
or (firstname and mobilephone)are same
or (firstname and DOB)are same)
or
(either (phone and officephone)
or (phone and mobilephone) are same
or (phone and DOB)are same)
or
(either(officephone and mobilephone)
or (officephone and DOB)are same)
or
(either(mobilephone or DOB)are same)

Some assumptions & validations to this are:
1) A filter upon date range(CreatedOn) of creation of records may be applied.
2) The records should be ordered on basis of Firstname.
3) Any null values for the above conditions shouldalos be taken into account as Duplicate.
4) And yes, there exist primary key on columns Ownerid and Leadid.

These are some set of conditions that the query would use to get the duplicate records.

after many attempts and fights this is what I came out with but i am not staisfied with it as it is taking hell lot of time:-

Select a.OwnerID,LeadID,Phone,MobilePhone,OfficePhone,FirstName,DOB ,Location,a.ProductID,
a.StatusCodeID,l.StatusCode,Address from ge_dedupe_leads a left Outer Join LeadStatusCodes l on l.OwnerID = 32
AND a.StatusCodeID = l.StatusCodeID
where LeadID in ( select First from ( select a.Leadid First, b.leadid Second,1 as dupcount
from dedupe_recs a,dedupe_recs b where a.location = b.location
and a.leadid <> b.leadid
AND ((a.Firstname = b.Firstname
AND ( a.OfficePhone = b.OfficePhone or a.Phone = b.Phone
or a.MobilePhone = b.MobilePhone or a.DOB = b.DOB)
)
or (a.OfficePhone = b.OfficePhone
AND ( a.Phone = b.Phone or a.MobilePhone = b.MobilePhone or a.DOB = b.DOB)
)
or (a.MobilePhone = b.MobilePhone
AND ( a.DOB = b.DOB or a.Phone = b.Phone )
)
or (a.DOB = b.DOB
AND a.Phone = b.Phone)
)
AND a.CreatedOn BETWEEN
to_date('1/11/2005','DD/MM/YYYY') AND
to_date('31/12/2005','DD/MM/YYYY')
AND b.CreatedOn BETWEEN
to_date('1/11/2005','DD/MM/YYYY') AND
to_date('31/12/2005','DD/MM/YYYY')
)
group by First, Second
)
order by FirstName;

Can you please suggest me a better alternative to this, which satisfies the above conditions and runs faster?



Tom Kyte
April 03, 2006 - 8:14 am UTC

so, what logic will you use when a given record is a duplicate of 15 other records - none of which is a duplicate of "it"

I mean - a single record might be a dup of lots of records - all for different reasons. do you just want to know "I am a duplicate" or what?

Complex SQL Query - Find Duplicate Records.

Vikas, April 03, 2006 - 8:53 am UTC

Dear Mr. Kyte

Thanx a lot for your prompt response (as usual).
As a feedback to your response, The above query will be used to identify the duplicate records on the basis of above predefined conditions and validations which will be then...

Firstly, Moved to another table Dup_Rec_History or a flat file.
Secondly, delete the duplicate records from main Table.

To achive this we are palnning to use

1) Create a procedure which might be execute for only once on the main table & do the honors, and then

2) Use a trigger on insert which will check for the duplicate record each time a record is entered and do the needful.

or

3) Just use the above created procedure and schedule it to run weekly.

But the only problem is the duplicate record search, which take too much time/memory/IO and temp segments.

Regards...
Vikas.

Delete duplicate rows

Ashok Kumar, May 04, 2006 - 4:12 am UTC

It helps a lot,

Thanks

Deleting duplicate records

Mohammed Khaled, July 25, 2007 - 8:02 am UTC

Hi Tom,

My requirement is to find out duplicate set of column values (pertaining to multiple columns)
based on one column value.

I have simulated my project requirement here in dummy table. Data in my table is like that

pmkey prodid prodname
-------------------------
101 1 AA
102 1 AA
103 2 AA
104 2 BB
105 3 BB
106 3 AA
107 4 AA
108 4 BB
109 4 CC

Here I wanted to return prodid's who are having duplicate set of prodnames like from above data
SQL should return prodid 2 and 3 only because these two prodids are having duplicate set of
prodname AA and BB.

Thanks for your help.


Create and insert statements for dummy table

CREATE TABLE DUPDEMO
( PRMKEY NUMBER,
PRODID NUMBER,
PRODNAME VARCHAR2(30 BYTE)
) ;


-- INSERTING into
Insert into dupdemo values (101,1,'AA');
Insert into dupdemo values (102,1,'AA');
Insert into dupdemo values (103,2,'AA');
Insert into dupdemo values (104,2,'BB');
Insert into dupdemo values (105,3,'BB');
Insert into dupdemo values (106,3,'AA');
Insert into dupdemo values (107,4,'AA');
Insert into dupdemo values (108,4,'BB');
Insert into dupdemo values (109,4,'CC');


Thanks and Regards,
Mohammed Khaled
Tom Kyte
July 26, 2007 - 9:19 am UTC

ops$tkyte%ORA10GR2> select prodid
  2    from (
  3  select prodid, count(*) over (partition by scbp) cnt
  4    from (
  5  select prodid, max( sys_connect_by_path( prodname, ',') ) scbp
  6    from (
  7  select d.*, row_number() over (partition by prodid order by prodname) rn
  8    from dupdemo d
  9         )
 10   start with rn = 1
 11   connect by prior rn = rn-1 and prior prodid = prodid
 12   group by prodid
 13         )
 14         )
 15   where cnt > 1
 16  /

    PRODID
----------
         3
         2

very useful

A reader, July 25, 2007 - 2:09 pm UTC


RE: Deleting duplicate records

Frank Zhou, July 26, 2007 - 11:54 am UTC

Here is an alternative SQL solution for the above question.

Frank

SQL> select prodid 
  2  from
  3  (select prodid, count(*) over (partition by str) cnt
  4  from 
  5  ( SELECT prodid,XMLAgg(XMLElement(x,PRODNAME) order by PRODNAME).getstringval() str 
  6    FROM dupdemo group by prodid
  7  )
  8  )
  9  where cnt >1 ;

    PRODID
----------
         2
         3

SQL> 

Deleting duplicate records

Mohammed Khaled, July 27, 2007 - 10:29 am UTC

Tom,

Thank you very much for your quick solution.

Regards,
Khaled

Duplicate check

SK, October 31, 2007 - 7:21 am UTC

Hi,

Please tell me which one is better by considering performance issue.
We have 13 million rows in ma_item_test on live DB.

DULICATE LOGIC ¿ 1

Item_no NOT IN
(SELECT vt_item_no
FROM ma_item_test vm_tran_a
WHERE exists (SELECT 'X'
FROM ma_item_test vm_tran_b
WHERE vm_tran_a.vt_item_no = vm_tran_b.vt_item_no
AND vm_tran_b.rowid > vm_tran_a.rowid))


DUPLICATE LOGIC - 2

Item_no NOT IN
(SELECT vt_item_no,vt_status
FROM ma_item_test
WHERE vt_item_no IN (SELECT vt_item_no
FROM ma_item_test
WHERE vt_status <> 'SUCCESS'
GROUP by vt_item_no
HAVING count(vt_item_no) > 1))
/

Regards,

SK

Tom Kyte
November 01, 2007 - 4:04 pm UTC

to check for duplicates you would just

select c1, c2, c3, ....
  from table
 group by c1, c2, c3, ....
having count(*) >1;


you would not use a not in. I am not following your logic.

duplicate check requirement.

SK, November 02, 2007 - 2:40 am UTC

Hi,

I need (not in) because my requirement is as below.

Poll the ma_item_test
where conditions like
VT_START_DTTM = null and
VT_STATUS = 'NEW' or
VT_STATUS = 'PENDING' and (either no previous records with the same item exist, or previous records with the same item number have a VT_STATUS of 'SUCCESS')

So query will look like this.

CURSOR cvm_trans_cur IS
SELECT *
FROM ma_item_test
WHERE vt_start_dttm IS null
AND (vt_status = 'NEW' OR
vt_status = 'PENDING')
AND vt_item_no NOT IN (SELECT vt_item_no
FROM ma_item_test
WHERE vt_item_no IN (SELECT vt_item_no
FROM ma_item_test
WHERE vt_status <> 'SUCCESS'
GROUP BY vt_item_no
HAVING count(vt_item_no ) > 1))


Please suggest me your view on this query.

Thasnk & Regards
SK
Tom Kyte
November 02, 2007 - 1:06 pm UTC

you use the word "previous"

define "previous" here - I see no order by, what do you mean "previous"

Too specific

S, November 02, 2007 - 2:30 pm UTC

I guess Tom was over specific in his follow-up on Nov 1.

Instead of:
select c1, c2, c3, ....
  from table
 group by c1, c2, c3, ....
having count(*) >1;

more general answer would be:
select c1, c2, c3, ....
  from (<any query here>)
 group by c1, c2, c3, ....
having count(*) >1;

S

Duplicate check

SK, November 05, 2007 - 7:55 am UTC

Hi Tom,

Here is the explanation for ¿previous¿.

(either no previous records with the same item exist, or previous records
with the same item number have a VT_STATUS of 'SUCCESS')

This means if there is any record with same vt_item_no (i.e duplicate) don¿t consider this record for cvm_trans_cur select cursor.
Or if there is any record with same vt_item_no and if the vt_status = `SUCCESS¿ then consider this record for cvm_trans_cur select cursor.

Hope this will help to understand query criteria.


I need (not in) because my requirement is as below.

Poll the ma_item_test
where conditions like
VT_START_DTTM = null and
VT_STATUS = 'NEW' or
VT_STATUS = 'PENDING' and (either no previous records with the same item exist, or previous records
with the same item number have a VT_STATUS of 'SUCCESS')

So query will look like this.

CURSOR cvm_trans_cur IS
SELECT *
FROM ma_item_test
WHERE vt_start_dttm IS null
AND (vt_status = 'NEW' OR
vt_status = 'PENDING')
AND vt_item_no NOT IN (SELECT vt_item_no
FROM ma_item_test
WHERE vt_item_no IN (SELECT vt_item_no
FROM ma_item_test
WHERE vt_status <> 'SUCCESS'
GROUP BY vt_item_no
HAVING count(vt_item_no ) > 1))


Please suggest me your view on this query.

Thasnk & Regards
SK


Tom Kyte
November 06, 2007 - 9:10 am UTC

just use analytics, you want to know if some record in that group have 'success' (it is not previous, it is not next, it is not positional at all - your terminology is confusing)

I don't get your double nested subquery at all.

(SELECT vt_item_no
FROM ma_item_test
WHERE vt_item_no IN (SELECT vt_item_no
FROM ma_item_test
WHERE vt_status <> 'SUCCESS'
GROUP BY vt_item_no
HAVING count(vt_item_no ) > 1))

isn't that just

(SELECT vt_item_no
FROM ma_item_test
WHERE vt_status <> 'SUCCESS'
GROUP BY vt_item_no
HAVING count(vt_item_no ) > 1)

you don't need to double query like that.

something like this (just assign a count of success records to each record by vt_item_no

select * 
  from (
select *, 
       count( case when vt_status <> 'SUCESS' then 1 end ) over (partition by vt_item_no) cnt 
  from t
       )
 where cnt <= 1
   and vt_status in ('NEW', 'PENDING' )
   and vt_start_dttm is null
/


Duplicate check

SK, November 05, 2007 - 7:55 am UTC

Hi Tom,

Here is the explanation for ¿previous¿.

(either no previous records with the same item exist, or previous records
with the same item number have a VT_STATUS of 'SUCCESS')

This means if there is any record with same vt_item_no (i.e duplicate) don¿t consider this record for cvm_trans_cur select cursor.
Or if there is any record with same vt_item_no and if the vt_status = `SUCCESS¿ then consider this record for cvm_trans_cur select cursor.

Hope this will help to understand query criteria.


I need (not in) because my requirement is as below.

Poll the ma_item_test
where conditions like
VT_START_DTTM = null and
VT_STATUS = 'NEW' or
VT_STATUS = 'PENDING' and (either no previous records with the same item exist, or previous records
with the same item number have a VT_STATUS of 'SUCCESS')

So query will look like this.

CURSOR cvm_trans_cur IS
SELECT *
FROM ma_item_test
WHERE vt_start_dttm IS null
AND (vt_status = 'NEW' OR
vt_status = 'PENDING')
AND vt_item_no NOT IN (SELECT vt_item_no
FROM ma_item_test
WHERE vt_item_no IN (SELECT vt_item_no
FROM ma_item_test
WHERE vt_status <> 'SUCCESS'
GROUP BY vt_item_no
HAVING count(vt_item_no ) > 1))


Please suggest me your view on this query.

Thasnk & Regards
SK


Duplicate rows

Venkat, March 10, 2008 - 4:54 pm UTC

Tom,

I have a table "Address" which holds Address info.
Name_line_1
Addr_line_1
Addr_line_2
City
State
Zip_code

Ideally one person(column Name_line_1) should have only one address. If a person is having multiple address data then treat them as Duplicates. I would like to find those duplicate records.
Example Data
Name_line_1 addr_line_1 addr_line_2 city state zip_code
Harry 100 mount rd Null Houston TX 77082
Harry 100 mount rd Null Houston TX 77082-2728

Below pasted my query which retrieves such duplicates.

select distinct name_line_1, addr_line_1, city, state, zipcode
from address
where name_line_1 in (
select name_line_1 from (
select distinct name_line_1, addr_line_1, city, state, zipcode
from address)
group by name_line_1
having count(*) > 1
)

Is there any otherway to write simplest query.
Thanks in Advance


Tom Kyte
March 10, 2008 - 9:20 pm UTC

if name_line_1 is supposed to be unique, your query does NOT return all of the dups.

but whatever. (sigh, data integrity, wonder what happened to it... so sad)

select * 
  from (select t.*, count(*) over (partition by name_line_1) cnt
          from your_table t
       )
where cnt > 1;



not exactly the same as yours, but returns the duplicates which is what you said you wanted..

Keeping min or max rowid

Sinardy, March 11, 2008 - 2:33 am UTC

Hi Tom,

From your example you are using max(rowid) instead of min(rowid), is that mean in any given similar records you prefer to keep the newest one? any scientific reason by doing so?

Thank you.
Sinardy
Tom Kyte
March 11, 2008 - 7:15 am UTC

you may use anyone you wish - the goal was just to keep "a record", the max rowid is as good as the min rowid is as good as any other rowid in the set of duplicate records.

delete duplicates

RB, July 02, 2008 - 7:22 pm UTC

Hi Tom,

I have a table with 4 columns col1,col2,col13,upload_date.
I wanted to keep the row with max update date and delete rest of them. Please let me know if there a better way(SQL) to do this other than a pl/sql block?

create table t1 (col1 number, col2 number, col3 number,upload_date date);

insert into t1 values(100,101,102,to_date('1-MAY-2008 11:51:49','DD-MON-YYYY HH:MI:SS'));
insert into t1 values(100,101,102,to_date('1-MAY-2008 11:52:25','DD-MON-YYYY HH:MI:SS'));
insert into t1 values(100,101,102,to_date('1-MAY-2008 11:53:19','DD-MON-YYYY HH:MI:SS'));

insert into t1 values(200,201,202,to_date('1-JUN-2008 10:44:49','DD-MON-YYYY HH:MI:SS'));
insert into t1 values(200,201,202,to_date('1-JUN-2008 10:56:23','DD-MON-YYYY HH:MI:SS'));
insert into t1 values(200,201,202,to_date('1-JUN-2008 10:58:12','DD-MON-YYYY HH:MI:SS'));


begin
for i in
(select col1, col2, col3, count(*)
from t1
group by col1, col2, col3
having count(*) > 1)
loop
delete from t1
where col1 = i.col1
and col2 = i.col2
and col3 = i.col3
and upload_date <
(select max(upload_date)
from t1
where col1 = i.col1
and col2 = i.col2
and col3 = i.col3);
end loop;

end;

Thanks!
Tom Kyte
July 07, 2008 - 9:26 am UTC

NO procedural CODE!!! please - NO procedural CODE, this is just sql, no procedural code needed


ops$tkyte%ORA11GR1> select t1.*, row_number() over (partition by col1, col2, col3 order by upload_date DESC NULLS LAST) rn
  2    from t1
  3   order by col1, col2, col3, upload_date desc nulls last
  4  /

      COL1       COL2       COL3 UPLOAD_DATE                  RN
---------- ---------- ---------- -------------------- ----------
       100        101        102 01-may-2008 11:53:19          1
       100        101        102 01-may-2008 11:52:25          2
       100        101        102 01-may-2008 11:51:49          3
       200        201        202 01-jun-2008 10:58:12          1
       200        201        202 01-jun-2008 10:56:23          2
       200        201        202 01-jun-2008 10:44:49          3

6 rows selected.

ops$tkyte%ORA11GR1> delete from t1
  2  where rowid in
  3  (select rid
  4     from (select rowid rid, row_number() over (partition by col1, col2, col3 order by upload_date DESC NULLS LAST) rn
  5             from t1)
  6   where rn > 1
  7  )
  8  /

4 rows deleted.

ops$tkyte%ORA11GR1> select * from t1;

      COL1       COL2       COL3 UPLOAD_DATE
---------- ---------- ---------- --------------------
       100        101        102 01-may-2008 11:53:19
       200        201        202 01-jun-2008 10:58:12


asheesh dhupper, August 13, 2008 - 7:19 am UTC

Hi Tom,

I have a table on Oracle 'Oracle8i Enterprise Edition Release 8.1.7.4.0 - Production' which has duplicate values.
I need to delete duplicate records like suppose
1)if there are two rows having null values except PK(empid) and the last_modification_date
2)if in field "name" has two rows with value 'XYZ',the last_mdification_date and PK(Empid) and all values are null
3)if there are two field "name" and "dept" which has value 'XYZ' and '10' .

SNo. empid name dept sal last_modification_date
1 10 31-jul-2008
2 20 01-aug-2008
3 30 XYZ 24-aug-2008
4 40 XYZ 25-aug-2008
5 50 XYZ 20 26-aug-2008
6 60 XYZ 20 27-aug-2008
7 70 XYZ 28-aug-2008
Now in the above scenario I want delete rows with SNo. 1,3,4,5

Please guide.....
Tom Kyte
August 18, 2008 - 9:13 am UTC

your description of this is fairly "not all there"

look at #3 for example (where is '10' in any of the example)


do you really really mean "XYZ" or "same values"

are you really saying:

pretend my key is name, dept, sal - and pretend that NULL does = NULL, I would like to remove duplicates based on that when EMPID and LAST_MODIFICTION_DATE are not null....


if so, you really sort of need a create table, insert into table and to state better what you mean.


start with this:

select *
from (select t.*, rowid rid, row_number() over (partition by name, dept, sal order by rowid) rn from t)
where rn <> 1;

that should probably be the set of rows you want to delete.

if so,

delete from t where rowid in (that query);



Asheesh Dhupper, August 18, 2008 - 2:46 am UTC

Hi Tom,

I will be very grateful to u for ur help.
Tom Kyte
August 20, 2008 - 9:29 am UTC

I'll let "U" know the moment I meet that person.

Hopefully, they will know what "UR" is, my knowledge of early Mesopotamia is rusty.

Asheesh Dhupper, August 19, 2008 - 6:31 am UTC

Hi Tom,
Thank u very much for your response, but it didn't help.
Just to clarify again I want to delete all duplicates without considering the empid which is the primary key and the last_modification_date. I wrote a PL/SQL Block too, but no use. Please help.

Deleting Duplicate rows with Nulls

Raj, September 12, 2008 - 4:11 pm UTC

Hi Tom,

I have the situation like this

COL_1 COL_2

101
101 John
102 Smith
102
103 Rob
103 Rob
104
104

The difference between 101 and 102 is 101 has got first row is null and in 102 second row is null

Expected Result is

COL_1 COL_2

101 John
102 Smith
103 Rob
104

Tom Kyte
September 16, 2008 - 5:16 pm UTC

well, there is no such thing as a "first row"

what if the data was

101     <null>
101     John
101     Bill



what is the expected result then - think of all of the cases. You used the word "duplicate" - I don't see your logic for saying "these two rows are the same"

Right now it looks more like "col_2 can be null or not null by col_1, if it is not null - then col_2 will have the same value for all rows given that col_1"

But I'm not sure - fill the example all of the way out - be much more explicit.

Deleting the Duplicate Values

Raj, September 15, 2008 - 12:12 pm UTC

I forgot to indicate the Environment and Version of Oracle

Environment is UNIX
Version is ORACLE 9i i.e.,

Oracle9i Enterprise Edition Release 9.2.0.6.0 - 64bit Production
With the Partitioning option
JServer Release 9.2.0.6.0 - Production

Deleting Duplicate rows

Raj, September 18, 2008 - 11:57 am UTC

Any update on the scenario which I specified.

Rightnow solution is been arrived in three steps.

1. Delete first the Duplicates.
2. Next delete the duplicates rows where the col_2 has got not null values.
3. Next delete the duplicates rows where one of the col_2 value is null.

I am trying to arrive all the steps in a single query. I am not sure whether is achievable.

Thanks,
Raj.
Tom Kyte
September 18, 2008 - 9:26 pm UTC

Umm, hmmm, er...

did you read my question to you - ball in your court.


I don't know what the difference between #1 and #2 are and #3 doesn't make sense to me at all.

Marius NITU, March 19, 2010 - 9:44 am UTC

Hi Tom,

---------------------------------------------------
|Accounts|>---------|Custumers|---------<|Addreses|
---------------------------------------------------

I’m searching a solution (Oracle 9i) in order to find all customers which are duplicates in respect with their names (fname, lname, nickname, etc) and al least one of their accounts and at least one of their address; in this case customer 100 with 300 and 200 with 400
Thanks,

Create Table Customers(
cu_id number(10) primary key,
cu_fname varchar2(30),
cu_lname varchar2(30),
cu_nickname varchar2(30)
)
/
Create Table Addreses(
ad_id number(10) primary key,
cu_id references Customers(cu_id),
ad_zipcode varchar2(10),
ad_region varchar2(30),
ad_street varchar2(30)
)
/
Create Table Accounts(
ac_id number(10) primary key,
cu_id references Customers(cu_id),
ac_bname varchar2(10),
ac_bloc varchar2(30),
ac_bnumber varchar2(15)
)
/
insert into Customers Values (100, 'FNAME1', 'LNAME1', 'NICK1')
/
insert into Customers Values (200, 'FNAME2', 'LNAME2', 'NICK2')
/
insert into Customers Values (300, 'FNAME1', 'LNAME1', 'NICK1')
/
insert into Customers Values (400, 'FNAME2', 'LNAME2', 'NICK2')
/
insert into Customers Values (500, 'FNAME1', 'LNAME1', 'NICK1')
/
insert into Customers Values (600, 'FNAME4', 'LNAME4', 'NICK4')
/
insert into Addreses Values (1, 100, '1000', 'EAST', 'French Fries')
/
insert into Addreses Values (2, 100, '2000', 'EAST', 'Cheap Road')
/
insert into Addreses Values (3, 300, '1000', 'EAST', 'French Fries')
/
insert into Addreses Values (4, 300, 'A000', 'TEST', 'A road')
/
insert into Addreses Values (5, 200, '2010', 'EAST', '6 avenue')
/
insert into Addreses Values (6, 400, '2010', 'EAST', '6 avenue')
/
insert into Addreses Values (7, 500, '5000', 'EAST', 'Test Road')
/
insert into Addreses Values (8, 600, '5000', 'EAST', 'B Road')
/
insert into Accounts values (10, 100, 'BANK10', 'West street', '123456789')
/
insert into Accounts values (20, 100, 'BANK10', 'East street', '123456788')
/
insert into Accounts values (30, 200, 'BANK20', 'North side', '923456789')
/
insert into Accounts values (40, 300, 'BANK10', 'West street', '123456789')
/
insert into Accounts values (50, 400, 'BANK20', 'North side', '923456789')
/
insert into Accounts values (60, 500, 'BANK30', 'East street', '123456788')
/
insert into Accounts values (70, 600, 'BANK40', 'East street', '123456788')
/

Tom Kyte
March 19, 2010 - 12:01 pm UTC

I think this will do what you need, make sure you understand it before even thinking of using it...

ops$tkyte%ORA10GR2> with
  2  cust
  3  as
  4  (select *
  5     from (select c.*, count(*) over (partition by cu_fname, cu_lname, cu_nickname)cnt
  6             from customers c )
  7   where cnt > 1
  8  ),
  9  data
 10  as
 11  (select c.cu_id, cu_fname, cu_lname, cu_nickname,
 12          ad_zipcode, ad_region, ad_street,
 13                  ac_bname, ac_bloc, ac_bnumber
 14     from cust c, addreses ad, accounts ac
 15    where c.cu_id = ad.cu_id
 16      and c.cu_id = ac.cu_id
 17  )
 18  select distinct d1.cu_id, d2.cu_id
 19    from data d1, data d2
 20   where d1.cu_fname = d2.cu_fname
 21     and d1.cu_lname = d2.cu_lname
 22     and d1.cu_nickname = d2.cu_nickname
 23     and d1.ad_zipcode = d2.ad_zipcode
 24     and d1.ad_region = d2.ad_region
 25     and d1.ad_street = d2.ad_street
 26     and d1.ac_bname = d2.ac_bname
 27     and d1.ac_bloc = d2.ac_bloc
 28     and d1.ac_bnumber = d2.ac_bnumber
 29     and d1.cu_id < d2.cu_id
 30  /

     CU_ID      CU_ID
---------- ----------
       100        300
       200        400


build cust as possible duplicates, they at least have the same names, we know that for sure. Should be "small" I would presume - to help the rest of the stuff.

Now take that small set and add the combinations of addresses and accounts to it.

Take that smallish set and join it to itself by all of the 'Keys' (watch out for nulls, you didn't say how you wanted them to be treated, so I did the it the SQL way - they are NOT equal to each other). Keep any rows such that the keys match - and the first cu_id is less then the other cu_id

print out the matching pairs...

How this SQL is better?

Mukund Nampally, March 20, 2010 - 8:30 pm UTC

Hi Tom,
There was a response from one of the user on May 12, 2003 in this thread about deleting duplicate records.
The SQL is:

delete from t1 where
rowid in ( select rowid from t1 group by rowid, col1, col2, coln
minus
select max(rowid) from t1 group by col1, col2,coln);

Wondering if rowid in group by clause is a typo? I'm thinking how can one do group by on rowid as it would be unique.

Moreover, Can you explain why the above SQL would be better than:
delete from T1
where rowid <> ( select max(rowid)
from t1 b
where b.c1 = t1.c1
and b.c2 = t1.c2 )

Another way of SQL:
delete from t1 where
rowid in ( select MAX(rowid) from t1 group by col1, col2, coln having count(1) > 1);

would there be any difference in the plans if use = in the above SQL. And also you made a question on how many times the above SQL takes to de-dup this:
table t:

X
-------
1
1
1
1 --assuming this is max rowid

is that 3 times? what is your take on this one?
Tom Kyte
March 22, 2010 - 8:46 am UTC


I would not say theirs was very efficient.

... ( select rowid from t1 group by rowid, col1, col2, coln
minus
select max(rowid) from t1 group by col1, col2,coln);
....


the group by in the first part was just wasteful, it would be the same as:

select rowid from t1
minus
select max(rowid) from t1 group by c1, c2, cn


and I would prefer
select rid
  from (select rowid, row_number() over (partition by c1, c2, ... cn )rn
          from t)
where rn > 1



that would, in a single pass, get the set of rowids the other MINUS query gets in two passes with aggregation.



My current favorite would be

delete from t where rowid IN ( that query I just wrote );



Marius NITU, March 22, 2010 - 9:39 am UTC

Thanks Tom, it is exactly what I was searching for. In reality things are a little bit more complicated: the query is dynamically build, based on some user’s choices, sometimes we will consider nulls being equals sometimes different, sometimes we will hit only 1 or 2 of the three tables, etc.

Thanks again

duplicate rows

prithviraj, April 01, 2010 - 5:12 am UTC

Hi Tom,
I have 23480 rows of records having 65 columns in a table.
Idealy there should be 11234 records in the table. But most records are duplicated. some are triplicated(3 rows),some 4 rows.
Please tell me how do i delete the duplicated records keeping only on record.(11234 records should remain in the table)
e.g.
id name
1 aaa
1 aaa
2 bbb
2 bbb
3 ccc
3 ccc
3 ccc
---------
result should be
id name
1 aaa
2 bbb
3 ccc

Thanks,
Prithviraj India
Tom Kyte
April 05, 2010 - 12:58 pm UTC

start perhaps by reading this page???

I mean seriously. Here you are on a page named "Deleting duplicate records" with a conversation that goes back many years with quite a few examples of how to do this - examples that improve over time and get better as SQL was added to.

and you cannot figure out how to delete duplicates from your table?

DELETE DUPLICATED ROWS WITH SPECIFIC FIELDS

Momcilo, September 21, 2010 - 7:32 am UTC

Hi Tom,

We have a table with following records:

ID StartTime EndTime Duration OrigNum DestNum ...
111 22:00:01 22:20:23 00:20:22 567890 345678 ...
112 11:00:22 11:00:44 00:00:22 334222 345556 ...
113 22:00:01 22:20:23 00:20:22 567890 345678 ...
... ...

Table receives more than 5 million records daily. But, we have 2 or 3 duplicates for many records, and want to delete all but one for each call.
All the records has the different ID, but we have to delete all the records but one with the same StartTime, EndTime, Duration, OrigNum and DestNum.

Please help as soon as possible.

Thanks in advance!
Tom Kyte
September 21, 2010 - 4:07 pm UTC

this entire page is about doing just that????


delete from t
where rowid in (select rid from
(select rowid rid, row_number() over (partition by starttime, endtime, duration, orignum, destnum order by starttime) rn from t)
where rn >= 2);

and after you do that PUT A UNIQUE CONSTRAINT on the columns!!!!! use dml error logging if you want to allow future duplicates to be loaded and silently "unloaded" at the same time.

Sunny, October 07, 2010 - 3:07 pm UTC

Hi Tom,

I have a table T1 with columns C1, C2, C3 & C4 having some duplicates with the combination C1 & C2.

Row C1 C2 C3 C4
--- -- -- -- --
1 1 a u 20
2 2 b v 30
3 2 b w 15
4 2 b x 12
5 3 c y 40
6 3 c z 10

I would like to delete the duplicates but first add the C4 values of the rows to be deleted to the C4 value of the row left undeleted. Also, the duplicates to be deleted should be the rows having higher C3 values. The result should be as follows:


Row C1 C2 C3 C4
--- -- -- -- --
1 1 a u 20
2 2 b v 57
5 3 c y 50

Many thanks in advance.

Tom Kyte
October 11, 2010 - 11:24 am UTC

no creates
no inserts
no merge statement for you

Sunny, October 08, 2010 - 7:17 am UTC

Hi Tom,

I had to re-format the tables so they could be well arranged.
I have a table T1 with columns C1, C2, C3 & C4 having some duplicates with the combination C1 & C2.

Row C1 C2 C3 C4
--- -- -- -- --
1 1 a u 20
2 2 b v 30
3 2 b w 15
4 2 b x 12
5 3 c y 40
6 3 c z 10

I would like to delete the duplicates but first add the C4 values of the rows to be deleted to the C4 value of the row left undeleted. Also, the duplicates to be deleted should be the rows having higher C3 values.In other words, the row with the least C3 value should be retained. For instance, Row 2,3 & 4 are regarded as duplicate in this case, Row2 should be retained since it has the least C3 values(v) and the sum of the C4 values of Row 3 & 4 (15,12 resp.) should be added to that of Row2(30). The result should be as follows:


Row C1 C2 C3 C4
--- -- -- -- --
1 1 a u 20
2 2 b v 57
5 3 c y 50

How do I use the delete duplicate query to achieve this.
Many thanks in advance.

Tom Kyte
October 11, 2010 - 11:35 am UTC

no creates
no inserts
.....

Duplicate Rows

Sunny, October 12, 2010 - 5:18 am UTC

Hi Tom,

Here's the create,insert statements, sorry for not including it initially.


SQL> create table T1 (c1 char(1), c2 char(1), c3 char(1), c4 int);

Table created.

SQL> insert into T1 values ( '1', 'a', 'u', 20 );

1 row created.

SQL> insert into T1 values ( '2', 'b', 'v', 30 );

1 row created.

SQL> insert into T1 values ( '2', 'b', 'w', 15 );

1 row created.

SQL> insert into T1 values ( '2', 'b', 'x', 12 );

1 row created.

SQL> insert into T1 values ( '3', 'c', 'y', 40 );

1 row created.

SQL> insert into T1 values ( '3', 'c', 'z', 10 );

1 row created.

SQL>


My question again is:
Some rows are regarded as duplicates with the combination C1 & C2.
I would like to delete the duplicates but first, add the C4 values of the rows to be deleted to the C4 value of the row left undeleted. Also, the duplicates to be deleted should be the rows having higher C3 values.In other words, the row with the least C3 value should be retained. For instance, 
Row 2,3 & 4 are regarded as duplicate in this case, Row2 should be retained since it has the least C3 values(v) and the sum of the C4 values of Row 3 & 4 (15, 12 respectively) should be added to the C4 value of Row2(30) before the deletion. The result should be as follows:


Row      C1      C2      C3       C4
---      --      --      --       --
  1       1       a       u       20
  2       2       b       v       57
  5       3       c       y       50


Any help would be highly appreciated. I look forward to quick response.

Tom Kyte
October 12, 2010 - 8:00 am UTC

... I look forward to quick response. ...

not pushy or anything...


ops$tkyte%ORA10GR2> merge into t1
  2  using (select rowid rid, c1, c2, c3, c4,
  3                sum(c4) over (partition by c1, c2) sum_c4,
  4                            row_number() over (partition by c1, c2 order by c3) rn
  5                   from t1 ) new_t1
  6  on (t1.rowid = new_t1.rid)
  7  when matched then update set c4 = new_t1.sum_c4 delete where new_t1.rn > 1
  8  /

6 rows merged.

ops$tkyte%ORA10GR2>
ops$tkyte%ORA10GR2> select * from t1;

C C C         C4
- - - ----------
1 a u         20
2 b v         57
3 c y         50



that does an in place update. I would say that:

ops$tkyte%ORA10GR2> create table new_t1
  2  as
  3  select c1, c2, min(c3) c3, sum(c4) c4
  4    from t1
  5   group by c1, c2
  6  /

Table created.

ops$tkyte%ORA10GR2> drop table t1;

Table dropped.

ops$tkyte%ORA10GR2> rename new_t1 to t1;

Table renamed.

ops$tkyte%ORA10GR2>
ops$tkyte%ORA10GR2> select * from t1;

C C C         C4
- - - ----------
1 a u         20
2 b v         57
3 c y         50


would likely be more efficient time wise.

Duplicate Rows

Sunny, October 12, 2010 - 2:49 pm UTC

Works great! Thanks so much Tom, I really appreciate all your efforts in helping out and I've learnt a lot from this page. You are indeed a genius.

Deleting repeating sets of data

Mithun, September 28, 2011 - 9:59 am UTC

Hi Tom,

As this thread is related to removing duplicates i am posting my query here.
My requirement is as follows.

CREATE TABLE parent
(
key1 VARCHAR2(4 BYTE) NOT NULL,
key2 VARCHAR2(4 BYTE) NOT NULL,
key3 VARCHAR2(4 BYTE) NOT NULL,
key4 VARCHAR2(10 BYTE) NOT NULL,
key5 VARCHAR2(4 BYTE) NOT NULL,
key6 VARCHAR2(4 BYTE) NOT NULL,
data1 VARCHAR2(5 BYTE),
data2 VARCHAR2(5 BYTE)
);


ALTER TABLE parent ADD (
PRIMARY KEY
(key1,key2,key3,key4,key5,key6));


CREATE TABLE child
(
key1 VARCHAR2(4 BYTE) NOT NULL,
key2 VARCHAR2(4 BYTE) NOT NULL,
key3 VARCHAR2(4 BYTE) NOT NULL,
key4 VARCHAR2(10 BYTE) NOT NULL,
key5 VARCHAR2(4 BYTE) NOT NULL,
key6 VARCHAR2(4 BYTE) NOT NULL,
seq VARCHAR2(4 BYTE) NOT NULL,
ur VARCHAR2(5 BYTE),
lr VARCHAR2(5 BYTE)
);


ALTER TABLE child ADD (
PRIMARY KEY
(key1,key2,key3,key4,key5,key6,seq));


insert into parent(key1,key2,key3,key4,key5,key6,data1,data2) values('k1','k2','k3','k4','k5','k61','123','456');

insert into parent(key1,key2,key3,key4,key5,key6,data1,data2) values('k1','k2','k3','k4','k5','k62','23','46');

insert into parent(key1,key2,key3,key4,key5,key6,data1,data2) values('k1','k2','k3','k4','k5','k63','12','56');

insert into parent(key1,key2,key3,key4,key5,key6,data1,data2) values('k1','k2','k3','k4','k5','k64','13','45');



insert into child(key1,key2,key3,key4,key5,key6,ur,lr) values('k1','k2','k3','k4','k5','k61','1','3F','4L');
insert into child(key1,key2,key3,key4,key5,key6,ur,lr) values('k1','k2','k3','k4','k5','k61','2','4F','5L');
insert into child(key1,key2,key3,key4,key5,key6,ur,lr) values('k1','k2','k3','k4','k5','k62','1','AF','BL');
insert into child(key1,key2,key3,key4,key5,key6,ur,lr) values('k1','k2','k3','k4','k5','k62','2','BF','CL');
insert into child(key1,key2,key3,key4,key5,key6,ur,lr) values('k1','k2','k3','k4','k5','k63','1','3F','4L');
insert into child(key1,key2,key3,key4,key5,key6,ur,lr) values('k1','k2','k3','k4','k5','k63','2','4F','5L');
insert into child(key1,key2,key3,key4,key5,key6,ur,lr) values('k1','k2','k3','k4','k5','k64','1','AF','BL');
insert into child(key1,key2,key3,key4,key5,key6,ur,lr) values('k1','k2','k3','k4','k5','k64','2','BF','CL');

In the present scenario i have two child records for one parent record.In child table i have two duplicate sets of data i.e for


'k1','k2','k3','k4','k5','k61 and 'k1','k2','k3','k4','k5','k63 parents i have same set of data in same order.I have to delete such duplicate records and the corresponding
parent also.

i.e i want only one parent i.e first parent 'k1','k2','k3','k4','k5','k61 in the parent table and the corresponding two child rows in child table( repeating pair need to be deleted)

How to go about this?




Tom Kyte
September 28, 2011 - 12:24 pm UTC

Well, if we break your data up by

key1, key2, key3, key4, key5 (skipped key6 on purpose - that is not part of your key for de-dupping) and then order it by seq,ur,lr - and assign row numbers we get:

ops$tkyte%ORA11GR2> select p.rowid rid, p.*, c.seq, c.ur, c.lr,
  2         row_number() over (partition by p.key1, p.key2, p.key3, p.key4, p.key5, c.seq, c.ur, c.lr
  3                                order by p.key6 ) rn
  4    from parent p, child c
  5   where p.key1 = c.key1
  6     and p.key2 = c.key2
  7     and p.key3 = c.key3
  8     and p.key4 = c.key4
  9     and p.key5 = c.key5
 10     and p.key6 = c.key6
 11   order by p.key1, p.key2, p.key3, p.key4, p.key5, c.seq, c.ur, c.lr
 12  /

RID                KEY1 KEY2 KEY3 KEY4       KEY5 KEY6 DATA1 DATA2 SEQ  UR    LR            RN
------------------ ---- ---- ---- ---------- ---- ---- ----- ----- ---- ----- ----- ----------
AAAX3KAAEAAAJW8AAA k1   k2   k3   k4         k5   k61  123   456   1    3F    4L             1
AAAX3KAAEAAAJW8AAC k1   k2   k3   k4         k5   k63  12    56    1    3F    4L             2
AAAX3KAAEAAAJW8AAB k1   k2   k3   k4         k5   k62  23    46    1    AF    BL             1
AAAX3KAAEAAAJW8AAD k1   k2   k3   k4         k5   k64  13    45    1    AF    BL             2
AAAX3KAAEAAAJW8AAA k1   k2   k3   k4         k5   k61  123   456   2    4F    5L             1
AAAX3KAAEAAAJW8AAC k1   k2   k3   k4         k5   k63  12    56    2    4F    5L             2
AAAX3KAAEAAAJW8AAB k1   k2   k3   k4         k5   k62  23    46    2    BF    CL             1
AAAX3KAAEAAAJW8AAD k1   k2   k3   k4         k5   k64  13    45    2    BF    CL             2

8 rows selected.


Now, all of the rows with 2 or more assigned to them represent "bad" rows. I'll add your missing forieng key to make sure we do the right thing as far as parent/child goes:
ops$tkyte%ORA11GR2> alter table child add foreign key(key1,key2,key3,key4,key5,key6) references parent;

Table altered.


and this will be one of those rare moments we want to use a global temporary table:

ops$tkyte%ORA11GR2>
ops$tkyte%ORA11GR2> create global temporary table gtt ( rid rowid ) on commit delete rows;
                              
Table created.


We'll pop the rowids of the rows we want to delete from the parent in this table:


ops$tkyte%ORA11GR2>
ops$tkyte%ORA11GR2> insert into gtt
  2  select distinct rid
  3    from (
  4  select p.rowid rid,
  5         row_number() over (partition by p.key1, p.key2, p.key3, p.key4, p.key5, c.seq, c.ur, c.lr
  6                                order by p.key6 ) rn
  7    from parent p, child c
  8   where p.key1 = c.key1
  9     and p.key2 = c.key2
 10     and p.key3 = c.key3
 11     and p.key4 = c.key4
 12     and p.key5 = c.key5

 13     and p.key6 = c.key6
 14         )
 15   where rn >= 2
 16  /

2 rows created.


and then it is an easy matter to remove them:

ops$tkyte%ORA11GR2>
ops$tkyte%ORA11GR2> delete from child
  2  where (key1,key2,key3,key4,key5,key6)
  3  in (select key1,key2,key3,key4,key5,key6 from parent where rowid in (select * from gtt));

4 rows deleted.

ops$tkyte%ORA11GR2>
ops$tkyte%ORA11GR2> delete from parent where rowid in (select * from gtt);

2 rows deleted.

ops$tkyte%ORA11GR2>
ops$tkyte%ORA11GR2> select * from parent;

KEY1 KEY2 KEY3 KEY4       KEY5 KEY6 DATA1 DATA2
---- ---- ---- ---------- ---- ---- ----- -----
k1   k2   k3   k4         k5   k61  123   456
k1   k2   k3   k4         k5   k62  23    46

ops$tkyte%ORA11GR2> select * from child;

KEY1 KEY2 KEY3 KEY4       KEY5 KEY6 SEQ  UR    LR
---- ---- ---- ---------- ---- ---- ---- ----- -----
k1   k2   k3   k4         k5   k61  1    3F    4L
k1   k2   k3   k4         k5   k61  2    4F    5L
k1   k2   k3   k4         k5   k62  1    AF    BL
k1   k2   k3   k4         k5   k62  2    BF    CL


Mithun, September 29, 2011 - 3:14 am UTC

Perfect Tom,It works for the  example data given by me.May be i am not clear in my data i have provided,sorry about that

But my exact scenario is i have to consider entire set of data meaning...

suppose if i insert one more row for the key k61

SQL> insert into child(key1,key2,key3,key4,key5,key6,seq,ur,lr) 
  2  values('k1','k2','k3','k4','k5','k61','3','7F','8L');

1 row created.

SQL> commit;

Commit complete.

SQL> select * from child where key6='k61'
  2  union
  3  select * from child where key6='k63'
  4  ;

KEY1 KEY2 KEY3 KEY4       KEY5 KEY6 SEQ  UR    LR
---- ---- ---- ---------- ---- ---- ---- ----- -----
k1   k2   k3   k4         k5   k61  1    3F    4L
k1   k2   k3   k4         k5   k61  2    4F    5L
k1   k2   k3   k4         k5   k61  3    7F    8L
k1   k2   k3   k4         k5   k63  1    3F    4L
k1   k2   k3   k4         k5   k63  2    4F    5L

i want to compare k61(seq 1 to 3) as one set and k63(seq 1 to 2) as one set.So they are not equal though first two rows in each set are matching.

In this case i should not delete the child records.I should only delete when the entire set is exactly same like in the below scenario..

SQL> select * from child where key6='k62'
  2  union
  3  select * from child where key6='k64';

KEY1 KEY2 KEY3 KEY4       KEY5 KEY6 SEQ  UR    LR
---- ---- ---- ---------- ---- ---- ---- ----- -----
k1   k2   k3   k4         k5   k62  1    AF    BL
k1   k2   k3   k4         k5   k62  2    BF    CL
k1   k2   k3   k4         k5   k64  1    AF    BL
k1   k2   k3   k4         k5   k64  2    BF    CL

as the set k62(seq 1 to 2) is exactly same as k64(seq 1 to 2)..in this case i can delete the repeating set and corresponding parent..

1).So how to achieve this?
2)I have to run through this entire process frequently..shud i package this whole process in to a procedure and call that or else any other best way??

thanks for your time and  patience

Tom Kyte
September 30, 2011 - 12:01 pm UTC

please

a) test this thoroughly
b) make sure you understand the concept behind it

I did this sort of on the fly.

we'll join p to c and get a child record count associated with each record (that is joined data)

then we'll join joined_data to itself by key1..key5, cnt, seq, ur, lr and only keep the join pairs where key6 of the first joined data is less than key6 of the other.

We'll count again how many hits we got by key1..key6.

Of the joined_data join - we'll only keep rows whereby the same number of child records result after the second join.

These are the records you want to get rid of. So, in a verbose query:

ops$tkyte%ORA11GR2> with joined_data as
  2  (
  3  select p.rowid rid, p.key1, p.key2, p.key3, p.key4, p.key5, p.key6, c.seq, c.ur, c.lr,
  4         count(*) over (partition by p.key1, p.key2, p.key3, p.key4, p.key5, p.key6, p.key6 ) cnt
  5    from parent p, child c
  6   where p.key1 = c.key1 and p.key2 = c.key2 and p.key3 = c.key3 and p.key4 = c.key4 and p.key5 = c.key5 and p.key6 = c.key6
  7  )
  8  select *
  9    from (
 10  select jd1.key1, jd1.key2, jd1.key3, jd1.key4, jd1.key5, jd1.key6, jd2.key6 other_key6, jd1.cnt,
 11         count(*) over (partition by jd1.key1, jd1.key2, jd1.key3, jd1.key4, jd1.key5, jd1.key6, jd1.key6 ) cnt2,
 12             jd2.rid
 13    from joined_data jd1, joined_data jd2
 14   where jd1.key1 = jd2.key1
 15     and jd1.key2 = jd2.key2
 16     and jd1.key3 = jd2.key3
 17     and jd1.key4 = jd2.key4
 18     and jd1.key5 = jd2.key5
 19     and jd1.key6 < jd2.key6
 20     and jd1.cnt = jd2.cnt
 21     and jd1.seq = jd2.seq
 22     and jd1.ur = jd2.ur
 23     and jd1.lr = jd2.lr
 24         )
 25   where cnt = cnt2
 26  /

KEY1 KEY2 KEY3 KEY4       KEY5 KEY6 OTHE        CNT       CNT2 RID
---- ---- ---- ---------- ---- ---- ---- ---------- ---------- ------------------
k1   k2   k3   k4         k5   k62  k64           2          2 AAAX7GAAEAAAjWVAAD
k1   k2   k3   k4         k5   k62  k64           2          2 AAAX7GAAEAAAjWVAAD



You'll insert those distinct rowids into the global temporary table and proceed as above.


Instead of running through this process frequently, why don't you prevent it from happening in the first place???

Removing repeating sets of data

A reader, September 29, 2011 - 9:26 am UTC

Hi Tom,

Can you please take some time in answering my above requirement??

Thanks
Tom Kyte
September 30, 2011 - 5:51 pm UTC

can you please understand that I have a day job and cannot answer things immediately?

Yesterday I had to drive 5 hours round trip to a conference and speak for a few hours. I didn't respond then because I find it hard to drive and type at the same time.

Duplicate records removal

A reader, October 01, 2011 - 1:37 am UTC

Sorry Tom, for me being in hurry to get my things clarified.
I should have understood the fact that you have your own work to do.Will not repeat it again.Really appreciate your patience.

Removing Duplicate sets

A reader, October 01, 2011 - 8:36 am UTC

Hi Tom,

1.You have mentioned this as 'verbose query',Do you mean that it is the tedious way of doing the thing which i want??Should i think in the lines of using pl/sql for achieving my required result??Please don't get me wrong.I just want to know whether my understanding is correct or not.

2.I have mentioned that i need to repeat the same process every month i.e for removing duplicate sets of data,you have suggested me to avoid it in upfront instead of doing it again and again.But my problem is we are getting data from different sources and it goes through many transformations and reach us,so by the time the data reaches us there is chance of duplicate sets in the data,and we get the new source data every one month.That is why i have asked you how to package this entire process ?(i.e whether to create a single procedure,or package to call that procedure or package every month on the set of data i get)

Thanks Tom
Tom Kyte
October 01, 2011 - 3:05 pm UTC

1) The verbose query was there to demonstrate the concept- it verbosely returns a ton of data we ultimately do not care about. All we need are the rowids.


It has nothing to do with the size of the query - but rather the size of the results...

2) create a procedure that populates the global temporary table and issues the delete against the child followed by the delete against the parent.

I would have the procedure in a package along with other related procedures...

Removing Repeating sets of data

A reader, October 02, 2011 - 1:25 am UTC

Thanks a lot Tom,I am very grateful for the time you spent on answering many peoples requirements like me.I greatly appreciate the way you write sql's(For any complex requirement) and suggestions you give.thank you so much.

I just tried union for dupes so adding that to distro

rk, October 21, 2011 - 4:33 am UTC

sqlldr scott control=ldr_tst.ctl

cat ldr_tst.ctl
load data
 infile 'EMPLOYEES.txt'
APPEND 
into table EMPLOYEES_DUP
 fields terminated by "," optionally enclosed by "'"             
(EMPLOYEE_ID,FIRST_NAME,LAST_NAME,EMAIL,PHONE_NUMBER,HIRE_DATE,JOB_ID,SALARY,COMMISSION_PCT,MANAGER_ID,DEPARTMENT_ID)

loaded the data twice into a employees_dup table
create table employees as select * from employees_dup;

now we have 2 identical tables with 2 times same records.
40 rows in each table. non dupe is 20 rows.


SQL> CREATE TABLE EMPLOYEES_TEMP AS (SELECT * FROM EMPLOYEES UNION SELECT * FROM EMPLOYEES_DUP);

Table created.

SQL> SELECT COUNT(*) FROM EMPLOYEES_TEMP;

  COUNT(*)
----------
        20

SQL> SELECT DISTINCT EMPLOYEE_ID FROM EMPLOYEES_TEMP ORDER BY EMPLOYEE_ID;

EMPLOYEE_ID
-----------
        100
        101
        102
        103
        104
        107
        124
        141
        142
        143
        144

EMPLOYEE_ID
-----------
        149
        174
        176
        178
        200
        201
        202
        205
        206

20 rows selected.

got a table with non duplicate values.


Tom Kyte
October 21, 2011 - 8:18 pm UTC


that de-duplicates based on the entire row - DISTINCT on a single query would have been more clear and much more efficient (half the data - less in your case)

The original question wanted to de-duplicate based on c1/c2 - keeping some value of c3 - not based on the entire row.

Using Analytical Functions to Reduce Column Data Duplication

Bertram, January 20, 2012 - 1:16 pm UTC

Hi All,
Thanks all for posting your various problems.

The following works for using analytical functions in the issue of removing duplicate column data or an alternate to using the BREAK ON presentation formating:

SELECT
DECODE( RANK() OVER(ORDER BY C1),
ROW_NUMBER() OVER (ORDER BY C1),
C1, NULL) AS COL_NAME
C2,
SUM(C3)
FROM
T
GROUP BY ROLLUP (C1, C2).

Thanks to the Community for the continued support,
(BLL)