Thanks for the question, Prakash.

Asked: June 19, 2002 - 7:33 am UTC

Last updated: April 16, 2014 - 4:18 pm UTC

Version: 8.1.7

Viewed 50K+ times! This question is

You Asked

Hi Tom,

If there is any duplications of records in a table, we know how to eliminate the duplicate rows using rowid.
But is there any possibility to delete the duplicate records in a table without using rowid and rownum.
my friend who is working in Oracle for arnd 4 years says that it is not possible. But i think there should be some way to do this.

Pls give me some suggestion TOM. i would be thankful to you, if you can illustrate with examples.

Thanks in Advance

Prakash

and Tom said...

in a single statement -- no, relational algebra doesn't permit it. Entirely duplicate records (where all columns are the same) make this not possible.

using rownum -- no, you cannot use rownum either.

Using rowid, yes, you can use rowid.

Using procedural code, yes, you can use procedural code.

Using more then one statement, yes, I can do this in 3 statements:

insert dups into temp table
delete dups from orig table
put de-duped data back into orig table from temp table

Rating

(41 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

ummmm

Mal, June 19, 2002 - 10:59 am UTC

Hi Tom.

Love this web site, always learning from it.

I just want to clarify the logic to do this. Please correct me if I'm wrong.

1) Put duplicates in a temp table.

Create Table silly_dups as
Select *
From evil_table
Group By col1, col2, col3....
Having count(*) > 1;

2) Remove duplicates from table.
Delete From evil_table
where (col1, col2, col3...) in
(Select col1, col2, col3
From silly_dups);

3) Put duplicate rows back.

Insert into evil_table (col1, col2, col3...)
Select col1, col2, col3...
From silly_dups;

Is that what you meant?

June 19, 2002 - 9:45 pm UTC

3) select DISTINCT ....

but other then that, yes. It is more complex if the columns you consider "DUPS" are less then the number of columns in the table (eg: you have 5 columns -- a row is a dup if the first 3 are the same) -- but takes the same number of statements.

Why wouldn't you use ROWID

Trevor, June 19, 2002 - 6:10 pm UTC

Prakash,
Just wondering why you don't want to use ROWID?
Every row has a unique one, it's the fastest way to access
a row, you can remove the dups in a single SQL statement
as opposed to 3 or writing lines of PL/SQL

Regards
Trevor

I agree with Trevor

Ib, June 19, 2002 - 9:18 pm UTC

Even i am just wondering why the poster doesnt want to use RowId instead of complicating the work!!!

Ib.

June 19, 2002 - 9:51 pm UTC

Many times, I get asked questions of a purely academic nature. This is one of them -- they just "want to know" for whatever reason.

Not only academic question, resources plays role as well

Jaromir Nemec, June 21, 2002 - 4:55 pm UTC

In my opinion, if the table is small you may play with chasing of dubs as much as you will (e.g. with rowid); but with big tables (that is relative to your box and disc throughput) you simple MUST avoid the equi-join on the dubs columns and you must get them out with one simple sort of the table. From this point of view it is not an academic question bus one of the resources.

Jaromir D.B. Nemec

www.db-nemec.com

June 21, 2002 - 6:44 pm UTC

depends on the number of dups doesn't it (rhetorical question, answer = yes).

If the columns (or some subset thereof) upon which you decide are the columns to decide dups (i assume you meant dups, not dubs) are indexed, AND there are some small number of dups to consider, using the index will be just as good as using the rowid (it be cached after all -- you are proccessing those rows).

I was always of the opinion that its always OK to be academic (to wonder, could I). Never said to do this this way....

Sorry for being confusing, .....

Jaromir Nemec, June 23, 2002 - 4:25 pm UTC

Sorry for being confusing, of course I meant dups not dubs, sorry! Dubs could be German, but I am not sure!

Of course the task of removing and reinserting of known dups depends on the number of dups (this is the task 2) and 3) in the above response). The index on dup columns is very usefull here.

But the "hard work" is to find the dups (task 1) above - "put the duplicates in a temp table"). And this work depends not on the result (i.e. number of dups) but on the size of the searched table. It is obviously less complicated to find 10 dups in a table with 1 million records than to find the same 10 dups in a table with 100 million records.

My statement was: if the table is big enough, than there is a difference if the dup search (task 1) above) is done by
a) Equi-join on the searched table (= more resource consuming)
b) Sort of the searched table (= less resource consuming).

Interesting question is, if it is possible to use index (index columns are the dup columns) for this task of searching dups in a table. I have no idea, can you provide any hint?

Thanks

Jaromir Nemec

June 23, 2002 - 6:18 pm UTC

Agreed, using analytic functions we can do this efficiently. Most other methods require some sort of correlated subquery which is like the equi join you talk of.

Like this:

delete from t
where rowid in ( select rid
from ( select rowid rid,
row_number() over
( partition by x order by rowid ) rn
from t
)
where rn > 1 )
/

that'll remove "dups" based on the column X (just partition by the "key to de-dup by")

If you have an index on that column (and at least one of the columns in the index is NOT NULLABLE), then it'll use that index in the plan...

Another reason for de-duping without rowid

Johan Snyman, June 24, 2002 - 3:20 am UTC

Another reason for wanting to remove duplicates without using rowid's may be to have a more "pure SQL" approach that would work in non-Oracle databases as well, where rowid does not exist or an equivalent unique row identifier is not exposed to the user...

One would presumably only need to do this to clean up the data before migrating to Oracle, tho ;-)

What is we have more than 1 duplicate?

Pichai Bala, November 07, 2003 - 6:10 pm UTC

Hi Tom,
I think the problem with the analytic function is it can't be used to remove the duplicates if we have more than one duplicate in the table. Any thoughts.
Otherwise using Analytic functions is simple, straight and great.
Thanks for your wonderful service to the oracle community.

November 08, 2003 - 10:10 am UTC

sure it can.

you would partition by your "key", you would order by something, you would use row_number

select *
from ( select a.*,
row_number() over ( partition by THE_KEY order by ANYTHING ) rn
from t )
where rn = 1;

What if we have more than 1 duplicate?

Pichai Bala, November 10, 2003 - 1:20 pm UTC

Tom,
I would like to remove(delete) the duplicates using Analytic function if there is more than 1 duplicate for the KEY. I can't create a new table without the duplicates every time.

Thanks

November 10, 2003 - 3:05 pm UTC

delete from t
where rowid not in ( select rowid
from ( select a.*,
row_number() over ( partition by THE_KEY order by ANYTHING ) rn
from t )
where rn = 1);

would be one approach.

or

delete from t
where rowid in ( select rowid
from ( select a.*,
row_number() over ( partition by THE_KEY order by ANYTHING ) rn
from t )
where rn <> 1);

would perhaps be a better one.. depends on volumes of data.

Nice

Carl, February 18, 2004 - 10:59 am UTC

Dear sir,
Is there any alternate way to select duplicate rows from a
table other than the following
sql>select empno,ename from emp group by empno,ename
having count(*) > 1;
Do you have any other trick?
Please do reply.
Bye!

February 18, 2004 - 9:15 pm UTC

lots of them.

select *
from (select a.*, count(*) over (partition by <keys you want unique>) cnt
from some_table a
)
where cnt > 1;

is another way for example. it gets you the entire row -- not just the key.

A reader, April 26, 2004 - 10:10 pm UTC

Hi Tom,

I am trying to de-dupe using Max of DT field, however, if the value in
column y is blank or null,  then it should keep of y from DT lower than
the max of DT

SQL> select * from test1;


ID       Y          DT
---   ----     ----------
111   002      04/23/2004
111   001      04/24/2004
111               04/25/2004
222   005      04/25/2004
222   001      04/26/2004
222   003      04/24/2004


Result :

ID         Y        DT
---    ----  ----------
111    001   04/26/2004
222    001   04/26/2004


I would appreciate you help.

Thanks,

April 28, 2004 - 11:41 am UTC

I think you have a typo -- you meant 04/24/2004 in the id=111 row right....

Anyway, assuming you have this:

ops$tkyte@ORA9IR2> select * from t;
 
        ID Y   DT
---------- --- ----------
       111 002 04/23/2004
       111 001 04/24/2004
       111     04/25/2004
       222 005 04/25/2004
       222 001 04/26/2004
       222 003 04/24/2004
       333     04/25/2004
 
7 rows selected.
 
ops$tkyte@ORA9IR2>

<b>we can assign the row_number to the rows by ID by first putting all NON-NULL Y's first (decode does that) and then sorting by DT desc.  We'll keep all but the first row in each partition and delete them.  I added id=333 just to test the condition when ALL y's are null</b>


ops$tkyte@ORA9IR2> delete from t
  2   where rowid in
  3   (select rowid
  4      from (select row_number()
  5          over (partition by id order by decode(y,null,1,0), dt desc) rn
  6              from t )
  7     where rn <> 1 )
  8  /
 
4 rows deleted.
 
ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2> select * from t;
 
        ID Y   DT
---------- --- ----------
       111 001 04/24/2004
       222 001 04/26/2004
       333     04/25/2004

A reader, April 28, 2004 - 1:02 pm UTC

Hello Tom,

Thank you for your reply and I really apologize for not having problem written clearly.

The table has more than 3 columns and I am trying to keep the best values for each column with in a group depending on the latest date. I think I should call it as consolidation of the table. There may be cases when the DT would be null.

ops$tkyte@ORA9IR2> select * from t;

ID Y X DT
---------- --- -- ----------

111 002 L 04/23/2004
111 001 M 04/24/2004
111 A 04/25/2004
111 001 L null

222 005 L 04/25/2004
222 001 L 04/26/2004
222 003 M 04/24/2004

333 O 04/25/2004

444 A null
444 Z null

Excepted Result:

ID Y X DT
---------- --- -- ----------
111 001 A 04/25/2004
222 001 L 04/26/2004
333 O 04/25/2004
444 Z

Thanks a lot.

April 28, 2004 - 7:29 pm UTC

soooo.... 

why 444 z?  
why not 444 a?

you must just want to add another sort order.  rather than saying "this is what I input, this is what I want to see" please list in a specification format - these are the RULES for which record to keep.

then and only then can one generate "code" that works to "spec".


I can get your output:


ops$tkyte@ORA9IR2> select * from t;
 
        ID Y   X DT
---------- --- - ----------
       111 002 L 04/23/2004
       111 001 M 04/24/2004
       111     A 04/25/2004
       111 001 L
       222 005 L 04/25/2004
       222 001 L 04/26/2004
       222 003 M 04/24/2004
       333     O 04/25/2004
       444     Z
       444     A
 
10 rows selected.
 
ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2>  delete from t
  2   where rowid in
  3   (select rowid
  4      from (select row_number()
  5          over (partition by id order by decode(y,null,1,0), dt desc) rn
  6              from t )
  7     where rn <> 1 )
  8  /
 
6 rows deleted.
 
ops$tkyte@ORA9IR2> select * from t;
 
        ID Y   X DT
---------- --- - ----------
       111 001 L
       222 001 L 04/26/2004
       333     O 04/25/2004
       444     Z


<b>(i inserted Z first :)

but I can also get your "non desired output"</b>:


ops$tkyte@ORA9IR2> select * from t;
 
        ID Y   X DT
---------- --- - ----------
       111 002 L 04/23/2004
       111 001 M 04/24/2004
       111     A 04/25/2004
       111 001 L
       222 005 L 04/25/2004
       222 001 L 04/26/2004
       222 003 M 04/24/2004
       333     O 04/25/2004
       444     A
       444     Z
 
10 rows selected.
 
ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2>  delete from t
  2   where rowid in
  3   (select rowid
  4      from (select row_number()
  5          over (partition by id order by decode(y,null,1,0), dt desc) rn
  6              from t )
  7     where rn <> 1 )
  8  /
 
6 rows deleted.
 
ops$tkyte@ORA9IR2> select * from t;
 
        ID Y   X DT
---------- --- - ----------
       111 001 L
       222 001 L 04/26/2004
       333     O 04/25/2004
       444     A
 
ops$tkyte@ORA9IR2>

What if We have multiple columns to determine for DUPES ?

Arindam, May 18, 2004 - 12:34 pm UTC

Tom,
Thank you very much for providing us this wonderful site, which helps people like us in their day to day job. I have a question regarding the same.

I have a customer table like :

create table Customer
( rid number(6)primary key,
Fname varchar2(30),
Lname varchar2(30),
Addr1 varchar2(50),
Addr2 varchar2(50),
City varchar2(50),
State Varchar2(20),
Zip number(5),
create_dt date,
update_dt date,
pri_phone number(11),
sec_phone number(11));

My customer can come from diffrent "Sources" but this table MUST have UNIQUE ADDRESS for each Customer. For example.

Source 1
J. John
Hoffman Estates, IL 60195

Source 2
J. John
Hoffman Estates, IL 60195

Now Source 1 and Source 2 will have different RID's (which is NOT Customer ID's) but they are the same "PERSON". I have to "Identify" them as same person and keep only ONE address and not duplicate it. Now the above table will have around 100million data.

I am thinking of introducing Unique constraint on Fname,
Lname,Addr1 , Addr2, City , State , Zip . The method you have shown to find the dups on first place , could that be used on my situation (performance-wise)? Is introduction of UNIQUE KEY have significant impact on Data Loads ? If there is an alternative could you please advise me with the same ?

Thank You

May 18, 2004 - 6:26 pm UTC

seems you have a data model problem more than anything. seems customer addresses should be factored out.

For Arindam

J, May 18, 2004 - 7:05 pm UTC

FYI: There are third-party solutions out there that are specifically developed for address standardization. Suggest you Google on "address standardization", and look for something that you can call from within Oracle. Ciao!
- J

Factored Out ?

Arindam, May 21, 2004 - 12:47 pm UTC

Tom,
Are you suggesting to denormalising the table ? Could you please tell me how would that help us in eliminating duplicate addresses ? Or am I making a wrong deduction from what you suggested ? Please help.

Thanks
Arindam

May 21, 2004 - 2:00 pm UTC

quite the opposite. I'm suggesting you want to NORMALIZE the addresses out of the table. you have it quite denormalized now.

Design...

Arindam, May 21, 2004 - 2:49 pm UTC

Sorry,
I meant Normalized... I wasn't looking what I was writing but I had the Normalize thing in my mind when I said that. Anyways, so you suggest something in the line of...

create table Customer
( rid number(6)primary key,
cust_id number,
Fname varchar2(30),
Lname varchar2(30),
create_dt date,
update_dt date,
pri_phone number(11),
sec_phone number(11));

create table address
( cust_id number (6),
Addr1 varchar2(50),
Addr2 varchar2(50),
City varchar2(50),
State Varchar2(20),
Zip number(5))

Do you suggest that after this normalization I should use the UNIQUE constrainsts ? Please do remember that a same customer (with same address) may come from "different sources" (eg. Best buy,Circuit City) and may not have same cust_id. The Customer Table is Kind of MASTER CUSTOMER data.

If I do something like above how about the load processing when we have a 100 million data to be inserted and may be 30 million data updated / inserted every year , where the data will be updated/inserted every month , wouldn't it be slowing down the update/insert ?

Thanks you for you prompt answer. Please help .

Thanks
Arindam.

May 22, 2004 - 11:01 am UTC

I would not call "customer" "customer" -- it isn't (in fact -- ADDRESS looks like customer to me :)

but yes, factor out the data.

if a customer has different cust_id's they ain't going the be the same customer are they -- not unless YOU build some sort of mapping table. Seems to be more left unsaid then said here.

Multiple tables

Warrick, May 27, 2004 - 5:36 am UTC

Thanks for the guidance Tom.

I have around 800 tables that have had duplicates inserted into them - far too many to work through each individually.

I understand your responses to the other queries, but how can I 'round up' all of the tables and clean out the dups?

May 27, 2004 - 9:28 am UTC

define "dups"

is "dups" defined as "entire row = some other row in same table"

or is dups defined as "some set of columns for each table -- different in each case"

delete duplicate rows

bijay, May 27, 2004 - 10:46 pm UTC

drop table t1;
create table t1 ( c1 int, c2 int, c3 char(1) );
insert into t1 values ( 1, 50, 'a' );
insert into t1 values ( 1, 50, 'a' );
insert into t1 values ( 1, 50, 'a' );
insert into t1 values ( 1, 50, 'b' );
insert into t1 values ( 2, 89, 'x' );
insert into t1 values ( 2, 89, 'y' );
insert into t1 values ( 2, 89, 'x' );
insert into t1 values ( 2, 89, 'x' );
insert into t1 values ( 2, 89, 'z' );
insert into t1 values ( 2, 89, 'z' );
insert into t1 values ( 2, 89, 'z' );
commit;
select * from t1;

delete from T1
where rowid <> ( select max(rowid)
from t1 b
where b.c1 = t1.c1
and b.c2 = t1.c2 )
/
select * from t1;

duplicate rows

Ravi, June 04, 2004 - 5:36 am UTC

Example given :-
----------------

drop table t1;
create table t1 ( c1 int, c2 int, c3 char(1) );
insert into t1 values ( 1, 50, 'a' );
insert into t1 values ( 1, 50, 'a' );
insert into t1 values ( 1, 50, 'a' );
insert into t1 values ( 1, 50, 'b' );
insert into t1 values ( 2, 89, 'x' );
insert into t1 values ( 2, 89, 'y' );
insert into t1 values ( 2, 89, 'x' );
insert into t1 values ( 2, 89, 'x' );
insert into t1 values ( 2, 89, 'z' );
insert into t1 values ( 2, 89, 'z' );
insert into t1 values ( 2, 89, 'z' );
commit;
select * from t1;


        C1         C2 C3
---------- ---------- --
         1         50 a
         1         50 a
         1         50 a
         1         50 b
         2         89 x
         2         89 y
         2         89 x
         2         89 x
         2         89 z
         2         89 z
         2         89 z

11 rows selected.

SQL1
-----------
delete from T1
    where rowid <> ( select max(rowid)
                     from t1 b
                     where b.c1 = t1.c1
                       and b.c2 = t1.c2 )
/
9 rows deleted.

SQL> select * from t1;

        C1         C2 C
---------- ---------- -
         1         50 b
         2         89 z


/********************my sql*************************************/


drop table t1;
create table t1 ( c1 int, c2 int, c3 char(1) );
insert into t1 values ( 1, 50, 'a' );
insert into t1 values ( 1, 50, 'a' );
insert into t1 values ( 1, 50, 'a' );
insert into t1 values ( 1, 50, 'b' );
insert into t1 values ( 2, 89, 'x' );
insert into t1 values ( 2, 89, 'y' );
insert into t1 values ( 2, 89, 'x' );
insert into t1 values ( 2, 89, 'x' );
insert into t1 values ( 2, 89, 'z' );
insert into t1 values ( 2, 89, 'z' );
insert into t1 values ( 2, 89, 'z' );
commit;
select * from t1;
        C1         C2 C3
---------- ---------- --
         1         50 a
         1         50 a
         1         50 a
         1         50 b
         2         89 x
         2         89 y
         2         89 x
         2         89 x
         2         89 z
         2         89 z
         2         89 z

11 rows selected.

SQL2
-----------
DELETE FROM t1 a
WHERE EXISTS(SELECT b.c1,b.c2,b.c3,count(*)
             FROM   t1 b
             WHERE  b.c1=a.c1
             AND    b.c2=a.c2
             AND    b.c3=a.c3
             GROUP BY b.c1,b.c2,b.c3
             HAVING count(*) >1);


9 rows deleted.

SQL> commit;

Commit complete.

SQL> select * from t1;

        C1         C2 C
---------- ---------- -
         1         50 b
         2         89 y

Now you look carefully , the sql2 gives different result(which i feel is correct).
In the above sql I used niether rownum nor rowid.
But I used all columns in table which I feel is ok.

Can you advise wheather this is ok?(ofcourse the sql2 takes little more memory and time)

June 04, 2004 - 8:25 am UTC

only you can answer this.

which is the correct way for YOU?

they are two totally different things.  One de-dups by c1,c2 -- leaving at least one of the dups behind.

Yours says "if c1,c2,c3 are duplicated, REMOVE ALL copies"
Mines says "if c1,c2 are duplicated, remote all BUT ONE copy"

consider:

ops$tkyte@ORA10G> create table t1 ( c1 int, c2 int, c3 char(1) );
 
Table created.
 
ops$tkyte@ORA10G>
ops$tkyte@ORA10G>
ops$tkyte@ORA10G> insert into t1 values ( 1, 1, 'a' );
 
1 row created.
 
ops$tkyte@ORA10G> insert into t1 values ( 1, 1, 'a' );
 
1 row created.
 
ops$tkyte@ORA10G>
ops$tkyte@ORA10G> commit;
 
Commit complete.
 
ops$tkyte@ORA10G>
ops$tkyte@ORA10G>
ops$tkyte@ORA10G> delete from T1
  2      where rowid <> ( select max(rowid)
  3                       from t1 b
  4                       where b.c1 = t1.c1
  5                         and b.c2 = t1.c2 )
  6  /
 
1 row deleted.
 
<b>one row gone, one row left behind</b>

ops$tkyte@ORA10G> rollback;
 
Rollback complete.
 
ops$tkyte@ORA10G>
ops$tkyte@ORA10G> DELETE FROM t1 a
  2  WHERE EXISTS(SELECT b.c1,b.c2,b.c3,count(*)
  3               FROM   t1 b
  4               WHERE  b.c1=a.c1
  5               AND    b.c2=a.c2
  6               AND    b.c3=a.c3
  7               GROUP BY b.c1,b.c2,b.c3
  8               HAVING count(*) >1);
 
2 rows deleted.
 
<b>both rows gone -- this is not what most people mean by de-duping</b>


apples and flying toaster ovens they are....


and there is nothing but nothing wrong with using rowid or anything else.

delete few from million

Prasad, August 25, 2004 - 1:32 pm UTC

good afternoon Tom,

I've to delete a very few say (10-15) records from a 3-4 million table
is this the BEST way to do it?

delete y
where rowid <> ( select max(rowid)
from y ins
where ins.HASH_VAL = y.hash_val)

I've a index on hash_val & its almost unique.

Or can you suggest a better way to achieve this?

August 25, 2004 - 1:40 pm UTC

delete from t
where rowid in ( select rid
from ( select rowid rid,
row_number() over
( partition by x order by rowid ) rn
from t
)
where rn > 1 )
/

using an index on that would be "not a good thing" (tm)

you want a big juicy full scan to find the 10-15 rowids you want to then go back and delete.

Brilliant !!

Prasad, August 25, 2004 - 1:48 pm UTC

Tom,

Worked perfectly. Deleted in 2.05 mins.

Many Thanks,

keep thge latest record as Active and deactivate all the old dupes

Sujit, August 23, 2005 - 11:05 am UTC

I have about 100 million records. I found lot of duplicates based on some criteria. Luckily, I have a row_status_ind (Active/Inactive) that we use to maintain history in the fact table. I could use this row_status_ind and deactivate all the old dupes. I would like to keep only the lastest record (of the dupe set of rows) in Active state. How would I do this.

The following statement does exactly the opposite - keeps the first row of the dupe set as A (active) and deactivates all the others after it with a I.

update t
set row_status_ind ='I'
where rowid in ( select rid
from ( select rowid rid,
row_number() over
( partition by product_code order by rowid ) rn
from f_region_product
)
where rn > 1 )

Please help. Thanks

August 24, 2005 - 4:28 am UTC

you cannot order by rowid and assume any sort of "time".

do you have a timestamp on this record to see which is "latest"

select rid
from ( select rowid rid, row_number() over (partition by code order by TIMESTAMP DESC ) rn
from t )
where rn > 1

will generate the set of "older" records.

keep thge latest record as Active and deactivate all the old dupes

Sujit, August 24, 2005 - 12:15 pm UTC

I do have a timestamp. Your solution worked out great.

You are the best. Thanks

AD, September 07, 2005 - 7:11 pm UTC

Tom,

First the create table scripts

CREATE TABLE TEMP
(
  ACCOUNT_NO  NUMBER(16)                        NOT NULL,
  COL2        VARCHAR2(4),
  COL3        NUMBER(3),
  COL4        VARCHAR2(4)
)

SQL> insert into temp values (100, 'x', 100, 'y'');

1 row created.

SQL> insert into temp values (100, 'x',60, 'x');

1 row created.

SQL> insert into temp values (200, 'x', 50, 'y');

1 row created.

SQL>  insert into temp values (200,  'x', null, 'y');

1 row created.

select * from temp;

ACCOUNT_NO COL2                                           COL3 COL4
---------- ---------------------------------------- ---------- 
----------
       100 x                                               100 Y
       100 x                                                60 x
       200 x                                                50 y
       200 x                                                   y
       





I want to delete duplicate records based on the following:

if col3 is not null then keep the row where col2=col4
if col3 is null then keep any of the records where col3 is null

e.g from the above data, I expect to get an o/p :


ACCOUNT_NO COL2                                           COL3 COL4
---------- ---------------------------------------- ---------- 
----------
     
       100 x                                                60 x
       200 x                                                   y
  

I would very much appreciate your help.

Thanks

September 08, 2005 - 8:16 am UTC

why that row for account_no = 200, why not the other one??

You have a case where col3 is NOT NULL but no row with col2=col4 - so we have an incomplete specificiation here?

AD, September 08, 2005 - 2:51 pm UTC

Tom,

Just to clarify the above requirement:

within each set of duplicate records(based on account_no)
if col3 is not null for all duplicate records then keep the row where col2=col4
if col3 is null for any of the duplicate records then keep any record where col3 is null

e.g. in the above case
account_no = 100
in this case all duplicate rows have col3 populated(not null) so pick the row where col2=col4 and the result is

ACCOUNT_NO COL2 COL3 COL4
---------- ---------------------------------------- ----------
----------

100 x 60 x

again for accoun_no =200
one of the duplicate row has col3 null so pick the row where col3 is null

ACCOUNT_NO COL2 COL3 COL4
---------- ---------------------------------------- ----------
----------
200 x y

September 08, 2005 - 5:49 pm UTC

still missing what happens if col3 is not null for all - and there are no col2=col4 records.....


but here is the concept:

ops$tkyte@ORA10G> select *
  2    from (
  3  select x.*,
  4         row_number() over (partition by account_no
  5                            order by
  6                            case when cnt_col3 = cnt and col2 = col4 then 1 else null end,
  7                            case when cnt_col3<> cnt and col3 is null  then 1 else null end) rn
  8    from (
  9  select temp.*,
 10         count(col3) over (partition by account_no) cnt_col3,
 11         count(*)    over (partition by account_no) cnt
 12    from temp
 13         ) x
 14         )
 15   where rn = 1
 16  /
 
ACCOUNT_NO COL2       COL3 COL4   CNT_COL3        CNT         RN
---------- ---- ---------- ---- ---------- ---------- ----------
       100 x            60 x             2          2          1
       200 x               y             1          2          1
 
ops$tkyte@ORA10G>
ops$tkyte@ORA10G> delete from temp
  2   where rowid in
  3  (
  4  select rid
  5    from (
  6  select x.*,
  7         row_number() over (partition by account_no
  8                            order by
  9                            case when cnt_col3 = cnt and col2 = col4 then 1 else null end,
 10                            case when cnt_col3<> cnt and col3 is null  then 1 else null end) rn
 11    from (
 12  select temp.*, rowid rid,
 13         count(col3) over (partition by account_no) cnt_col3,
 14         count(*)    over (partition by account_no) cnt
 15    from temp
 16         ) x
 17         )
 18   where rn <> 1
 19  )
 20  /
 
2 rows deleted.
 
ops$tkyte@ORA10G> select * from temp;
 
ACCOUNT_NO COL2       COL3 COL4
---------- ---- ---------- ----
       100 x            60 x
       200 x               y
 
ops$tkyte@ORA10G>

Excellent

AD, September 08, 2005 - 6:24 pm UTC

Tom,

Thanks very much,
I agree the spec still missing some info. But the solution meets what I wanted to get out of it.

Regards,

Duplicates

Priya, October 24, 2006 - 2:19 pm UTC

Hi Tom,
Here is the situation. I have to create an unique index on a combination of 4 columns scenario,month,year and currency. But since I have duplicates on the table, I did this
select distinct scenario,month,year,currency,count(*) as dups from tf_fx_rates_tbl
group by scenario,month,year,currency having count(*) > 1
As you said, I created a temp table to hold these duplicate records.
Now in the main table, there are additional columns rates and entry date. If I delete the duplicates from the main table, the data for these 2 columns are gone. How to delete duplicates without missing any other column data?
As always, your valuable suggestions are appreciated.

Thanks,
Priya.

October 24, 2006 - 2:33 pm UTC

umm, if you delete a row, why would you expect some columns to remain? I'm not at all sure what you mean.

Are you trying to delete all of the records EXCEPT for the most current one or something???

I found another way on your site

Priya, October 24, 2006 - 2:50 pm UTC

Never mind Tom, I found another solution to this on your site. That solved my problem.
delete from Tf_FX_RATES_TBL T1
where rowid <> ( select max(rowid)
from TF_FX_RATES_TBL b
where b.SCENARIO = t1.scenario
and b.year = t1.year
and b.month = t1.month
and b.CURRENCY = t1.currency )

Thanks as always

Deleting duplicate rows considering following scenarios

Sunil, February 27, 2007 - 4:47 am UTC

Hi Tom,

How can we delete duplicate rows from table considering following scenarios? A duplicate row is where another row exists with exactly the same values for col1, col2 and col3.
The table has over 100,000 rows.

How would our approach differ for the following scenarios?
a) Database can be offline to the users for a short time
b) Database must be kept online with time critical user retrieval queries needing to access the table during the update.

Regards,
Sunil.

February 27, 2007 - 10:45 am UTC

I would not envision taking the database "offline" at all.

A little over 100,000 rows, this will be so fast - no one will notice anything.

ops$tkyte%ORA10GR2>
ops$tkyte%ORA10GR2> create table t ( col1 int, col2 int, col3 int, col4 int );

Table created.

Elapsed: 00:00:00.02
ops$tkyte%ORA10GR2>
ops$tkyte%ORA10GR2> insert into t
  2  select id, object_id, rownum, data_object_id
  3    from big_table.big_table
  4   where rownum <= 100000;

100000 rows created.

Elapsed: 00:00:00.28
ops$tkyte%ORA10GR2>
ops$tkyte%ORA10GR2> insert into t
  2  select id, object_id, rownum, data_object_id
  3    from big_table.big_table
  4   where rownum <= 50000;

50000 rows created.

Elapsed: 00:00:00.13
ops$tkyte%ORA10GR2>
ops$tkyte%ORA10GR2> commit;

Commit complete.

Elapsed: 00:00:00.03
ops$tkyte%ORA10GR2>
ops$tkyte%ORA10GR2> delete from t
  2   where rowid in ( select rid
  3                      from (select rowid rid, row_number() over (partition by col1, col2, col3 order by rowid) rn
  4                              from t)
  5                    where rn <> 1 )
  6  /

50000 rows deleted.

Elapsed: 00:00:01.46
ops$tkyte%ORA10GR2>

Speed up deletes using indexes

Craig Richards, February 27, 2007 - 12:53 pm UTC

I was told by Oracle support that when deleting large amounts of data, it is best to index the search column ie

delete from test where id_no = 23;

Create an index on id_no and it should speed the delete up, I have tested this and it seems to take the same time whether I use the index or not

Can you please shed some light on this?

Cheers

February 27, 2007 - 2:12 pm UTC

sigh

the only answer is:

IT DEPENDS

if test is small, index won't make a big deal one way or another.

if test is large and the number of rows such that "id_no=23" is very small, then the index could be very beneficial. It'll make the full scan of test not be a full scan - true it'll take longer to delete each row (index to be maintained), but since the full scan is so much faster, it goes faster.

if test is large, but there are many rows such that "id_no=23", then the index could make this take much much much longer.

IT DEPENDS

Removing Duplicates Without Using ROWID or ROWNUM

Graham Lloyd, March 06, 2007 - 12:44 pm UTC

Would this construct not answer the original question? It performs well and removes all duplicates without using ROWID or ROWNUM.

CREATE TABLE acc (account_id NUMBER, account_name VARCHAR2(30));
INSERT INTO acc VALUES(1,'one');
INSERT INTO acc VALUES(2,'two');
INSERT INTO acc VALUES(3,'two');
INSERT INTO acc VALUES(4,'two');
INSERT INTO acc VALUES(5,'three');
INSERT INTO acc VALUES(6,'three');
INSERT INTO acc VALUES(7,'four');
COMMIT;

SQL> SELECT * FROM acc;

ACCOUNT_ID ACCOUNT_NAME
---------- ------------------------------
1 one
2 two
3 two
4 two
5 three
6 three
7 four

7 rows selected.

SQL>
SQL> DELETE FROM acc ac1
2 WHERE ac1.account_id IN (SELECT ac2.account_id
3 FROM acc ac2
4 MINUS
5 SELECT MIN(ac3.account_id)
6 FROM acc ac3
7 GROUP BY ac3.account_name
8 );

3 rows deleted.

SQL>
SQL> SELECT * FROM acc;

ACCOUNT_ID ACCOUNT_NAME
---------- ------------------------------
1 one
2 two
5 three
7 four

4 rows selected.

March 06, 2007 - 1:09 pm UTC

there were no entirely duplicate records.

you presume the existence of some unique value - that presumption cannot be made.

Yes, if you have a unique key (what rowid and rownum are "psuedo" providing), you can remove "duplicates" if you presume the unique key is not part of what is to be considered duplicate.

Query running from last 28 hours still rows not delted.

Bhavesh, March 30, 2007 - 3:51 am UTC

Hi,
Tom base on your row_number query i try to remove duplicate rows.

SELECT COUNT(*)
FROM RPT_TMP_GPRS_MAR_DATA_2007
COUNT(*)
----------
8385015
1 row selected

sample structure:

SELECT *
FROM RPT_TMP_GPRS_MAR_DATA_2007
WHERE ROWNUM<11
TAC CUSTOMER_ID CO_ID ROAM FAX DATA GPRS CREDT_CD
------ ----------- ---------- ---- --- ---- ---- ----------
010115 4875587 5136216 HR
010115 4875587 5136216 HR
010115 4875587 5136216 HR
010115 4875587 5136216 HR
010115 4875587 5136216 HR
010115 4875587 5136216 HR
010115 4875587 5136216 HR
010115 4875587 5136216 HR
010115 4875587 5136216 HR
010115 4875587 5136216 HR
10 rows selected

SELECT index_name
FROM user_indexes
WHERE table_name LIKE 'RPT_TMP_GPRS_MAR_DATA_2007';

INDEX_NAME
------------------------------
GPRS_FEB_2007_COID
GPRS_FEB_2007_CUSTID

2 rows selected

so index is on coid.

my query :

DELETE FROM RPT_TMP_GPRS_MAR_DATA_2007
WHERE ROWID NOT IN ( SELECT ROWID
FROM ( SELECT a.*,
row_number()over(PARTITION BY co_id ORDER BY ROWID) rn
FROM RPT_TMP_GPRS_MAR_DATA_2007 a )
WHERE rn = 1);

i want to remove duplicates on co_id basis. and its running since last 27 hours. still not completed. what to do? please help me out. i have to do this exercise on monthly basis.

March 30, 2007 - 1:31 pm UTC

a) try the in
b) evaluate how fast the subquery itself executes - how many rows are you actually going to delete here.

are you the only thing working on this table - are you getting blocked?

what are you big wait events.

why do you have to do this every month, why not fix the problem at the SOURCE????

Thanks for the question regarding "Deleting duplicate records without using rowid and rownum

suneel, May 17, 2007 - 7:35 am UTC

hi tom,
my table data contains

empno ename
102 t
101 s
101 s
102 t

i want to delete duplicate records in a particular table with out using rowid and rownum. my desired output is

empno ename
101 s
102 t

May 17, 2007 - 11:22 am UTC

good luck with that - since you must not be using Oracle, I'll ask you to ask the vendor of your database for help.

creating unique key without deleting the existing duplicate record

Nidhin, October 19, 2007 - 8:30 am UTC

We have a table that have nearly 100K records of which some are duplicate. We dont want to delete these duplicate records. We want to create a constarint which will prevent the insertion of any duplicate records. We are using Oracle 8i database. Please help

t replaced by an inline view (having join of 3 tables)

A Reader, October 29, 2008 - 2:03 pm UTC

Hi Tom,

Needs one clarification! How about table "t" replaced by an inline view (joining 2-3 tables).

Will this query sill hold good. I did some experiments and the Inline-view uses the ROWID for one of the tables.

Delete from t
where rowid in ( select rid
from ( select rowid rid,
row_number() over
( partition by x order by rowid ) rn
from t
)
where rn > 1 )

October 29, 2008 - 4:09 pm UTC

If you mean T all of the way down in the inline view already - sure it could be, but I would code:

Delete from t
  where rowid in ( select rid
                     from ( select rid,
                                   row_number() over 
                                          ( partition by x order by rid ) rn
                              from (select t.rowid RID, t.x
                                      from t, t1, t2
                                     where ..... )
                           )
                    where rn > 1 )

to remove any ambiguity.

As an example

A Reader, October 29, 2008 - 10:29 pm UTC

Delete
from
(
SELECT
a.Empno AS Emp_Id,
a.Ename AS Emp_Name,
b.Dname AS DEpt_Name
FROM
emp a, dept b
WHERE
a.deptno = b.deptno
) t
where rowid in ( select rid
from ( select rowid rid,
row_number() over
( partition by Emp_Name, DEpt_Name order by rowid ) rn
from (
SELECT
a.Empno AS Emp_Id,
a.Ename AS Emp_Name,
b.Dname AS DEpt_Name
FROM
emp a, dept b
WHERE
a.deptno = b.deptno
) t
)
where rn > 1 )
/

I was mentioning t to be an in-line view of EMP and DEPT. What I observed was ROWID of the INline-view (t) is taken as EMP table's ROWID and it is working.

Why is that "t" 's Rowid becomes EMP'ID. How does Oracle do it.

Thanks

October 30, 2008 - 8:05 am UTC

I would be EXPLICIT - select the rowid (like I said above!) just select out the ROWID you want.

Why rely on something you cannot see? Be explicit.

(we do it by identifying the key preserved table, we know your primary and foreign keys - if we didn't we couldn't)

ops$tkyte%ORA10GR2> create table emp as select * from scott.emp;

Table created.

ops$tkyte%ORA10GR2> create table dept as select * from scott.dept;

Table created.

ops$tkyte%ORA10GR2>
ops$tkyte%ORA10GR2> select rowid rid
  2  from (
  3  select empno, ename, dname from emp, dept where emp.deptno = dept.deptno
  4  );
select empno, ename, dname from emp, dept where emp.deptno = dept.deptno
                                *
ERROR at line 3:
ORA-01445: cannot select ROWID from, or sample, a join view without a key-preserved table


ops$tkyte%ORA10GR2>
ops$tkyte%ORA10GR2> alter table emp add constraint emp_pk primary key(empno);

Table altered.

ops$tkyte%ORA10GR2> alter table dept add constraint dept_pk primary key(deptno);

Table altered.

ops$tkyte%ORA10GR2> alter table emp add constraint emp_fk_dept foreign key(deptno) references dept(deptno);

Table altered.

ops$tkyte%ORA10GR2>
ops$tkyte%ORA10GR2> select rowid rid
  2  from (
  3  select empno, ename, dname from emp, dept where emp.deptno = dept.deptno
  4  );

RID
------------------
AAAoQqAAEAAABq8AAA
AAAoQqAAEAAABq8AAB
AAAoQqAAEAAABq8AAC
AAAoQqAAEAAABq8AAD
AAAoQqAAEAAABq8AAE
AAAoQqAAEAAABq8AAF
AAAoQqAAEAAABq8AAG
AAAoQqAAEAAABq8AAH
AAAoQqAAEAAABq8AAI
AAAoQqAAEAAABq8AAJ
AAAoQqAAEAAABq8AAK
AAAoQqAAEAAABq8AAL
AAAoQqAAEAAABq8AAM
AAAoQqAAEAAABq8AAN

14 rows selected.

but if you change one thing about your approach here and into the future and never rely on defaults, never rely on side effect, always be explicit - even if you means you have to type 5,000 more lines of code - or just the word ROWID - you will never be sorry - never.

You'll only be sorry if you do not :)

deleting duplicate record

Biswaranjan, May 02, 2010 - 8:48 am UTC

HI tom,

Super duper thanks to your support.

I am having a table with data.

table a

1 2
1 2
4 5
3 4
3 4
5 6
7 5
5 6

so how can I delte the second row of the duplicate data[frequency of duplicate data is 2] without using rowid or rownum.

regards,
Ranjan

May 06, 2010 - 11:23 am UTC

why? I don't do homework, if you want to do this, the correct answers have been provided.

You cannot possibly have a requirement that says "you cannot use X", rowid it somewhat necessary if you want to do this in SQL since you need to identify a single row in a table and for that you either need a primary key (which you obviously do not have) or a surrogate key (which we have graciously provided you via rowid)

A reader, May 14, 2010 - 12:38 am UTC

How can we delete records in a table using update table?

can rank() be used

vrna, July 20, 2011 - 5:44 pm UTC

Can we use rank() over (partition..
instead of row_number() here?

July 22, 2011 - 1:06 pm UTC

you can

but it would give you different answers.

what is the question you are trying to answer here? That would determine which of row_number, rank and dense_rank would apply.

rank = row_number if and only if the order by is ordering by something UNIQUE. If there are duplicates in your order by key, then rank and row_number return different results.

delete duplicates Without using rowid

Susil Kumar Nagarajan, December 14, 2011 - 9:56 am UTC

Apply mathematical concept here....
AUA = A, A intersect A = A

create table tmp_table1
as
Select * from table1
union
Select * from table1;

truncate table table1;

insert into table1 select * from tmp_table1;

drop table tmp_table1;

How we used to do it in COBOL

Andrew Reid, October 17, 2012 - 5:14 pm UTC

The URL below shows the standard way of deleting duplicates.
Then I have followed this by showing how we used to do it in COBOL but I have "translated" it into PL/SQL:

http://international-dba.blogspot.co.uk/2010/12/how-to-delete-duplicate-rows.html

Rao, April 14, 2014 - 12:28 pm UTC

Hi,
I am trying to query non-duplicate values from a table but not all fields are duplicate.
For example, if there are 3 duplicate values then i need to display only one(any field)
My table got data as below...
col1 col2 col3
---- ---------- -----
ibm thinkvision city2
ibm notepad city1
dell inspiron city1
dell alien city2
hp lenovo city3

and i need output as this..

col1 col2 col3
---- ---------- -----
ibm thinkpad city2
dell inspiron city1
hp lenovo city3

or this..

col1 col2 col3
---- ---------- -----
ibm notepad city1
dell alien city2
hp lenovo city3

Can you help me with a solution..

April 16, 2014 - 4:18 pm UTC

select col1, max(col2), max(col3) from t group by col1;

you want distinct col1's (apparently) and some data from col2, col3, ....

just group by col1 and apply any aggregate you want to col2, col3......

if you need the values in col2, col3... to all come from the "same" row (you don't say - your sample output seems to maybe 'imply' this, but you don't say) then

select * 
  from (select t.*, row_number() over (partition by col1 order by rowid) rn
          from t )
 where rn = 1

if there are syntax errors in my examples, that is your fault. no test case with creates/inserts - no way I'm going to actually try to run anything.

Live SQL

Share and learn SQL and PL/SQL; free access to the latest version of Oracle Database!

Dev Gym

Classes, workouts and quizzes on Oracle Database technologies. Expertise through exercise!

Deleting duplicate records without using rowid and rownum

Breadcrumb

Question and Answer

You Asked

and Tom said...

Rating

Comments

ummmm

Why wouldn't you use ROWID

I agree with Trevor

Not only academic question, resources plays role as well

Sorry for being confusing, .....

Another reason for de-duping without rowid

What is we have more than 1 duplicate?

What if we have more than 1 duplicate?

Nice

What if We have multiple columns to determine for DUPES ?

For Arindam

Factored Out ?

Design...

Multiple tables

delete duplicate rows

duplicate rows

delete few from million

Brilliant !!

keep thge latest record as Active and deactivate all the old dupes

keep thge latest record as Active and deactivate all the old dupes

Excellent

Duplicates

I found another way on your site

Deleting duplicate rows considering following scenarios

Speed up deletes using indexes

Removing Duplicates Without Using ROWID or ROWNUM

Query running from last 28 hours still rows not delted.

Thanks for the question regarding "Deleting duplicate records without using rowid and rownum

creating unique key without deleting the existing duplicate record

t replaced by an inline view (having join of 3 tables)

As an example

deleting duplicate record

can rank() be used

delete duplicates Without using rowid

How we used to do it in COBOL

More to Explore

Live SQL

Dev Gym