Skip to Main Content

Breadcrumb

May 4th

Question and Answer

Connor McDonald

Thanks for the question, gagan.

Asked: January 29, 2004 - 10:21 am UTC

Last updated: November 20, 2015 - 9:33 am UTC

Version: 9.0.2

Viewed 100K+ times! This question is

You Asked

plz explain how to remove duplicate records from a large table containing about 5 million records in a single run and with a lesser time.
i tried it with following query but it takes 10 hours of time.

delete from test1 where rowid not in (select min(rowid) from test1 group by rc_no);

even after incraesing the rollback segs tablespace to 7gb
we are not getting desired results and while using not in clause and cursor we generally come across this kind of problem

thanks

and Tom said...

I'd generate the set of rowids to delete using analytics and then delete them.. like this:


ops$tkyte@ORA9IR2> create table t as select * from cust;

Table created.

Elapsed: 00:00:03.64
ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2> select count(*), count(distinct cust_seg_nbr) from t;

COUNT(*) COUNT(DISTINCTCUST_SEG_NBR)
---------- ---------------------------
1871652 756667

Elapsed: 00:00:05.30
ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2> delete from t
2 where rowid in ( select rid
3 from ( select rowid rid,
4 row_number() over
5 (partition by cust_seg_nbr order by rowid) rn
6 from t
7 )
8 where rn <> 1 )
9 /

1114985 rows deleted.

Elapsed: 00:01:46.06
ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2> select count(*), count(distinct cust_seg_nbr) from t;

COUNT(*) COUNT(DISTINCTCUST_SEG_NBR)
---------- ---------------------------
756667 756667

Elapsed: 00:00:02.48



As for the RBS -- it'll get as big as it needs to be in order to process the delete -- every index will make it "larger" and take longer as well (index maintainence is expensive)

if you are deleting "alot of the rows" you might be better off disabling indexes, doing the delete and rebuilding them.


OR, creating a new table that just keeps the "right records" and dropping the old table:


ops$tkyte@ORA9IR2> create table t as select * from cust;

Table created.

Elapsed: 00:00:02.41
ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2> select count(*), count(distinct cust_seg_nbr) from t;

COUNT(*) COUNT(DISTINCTCUST_SEG_NBR)
---------- ---------------------------
1871652 756667

Elapsed: 00:00:04.60
ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2> create table t2
2 as
3 select cust_seg_nbr
4 from ( select t.*, row_number() over (partition by cust_seg_nbr order by rowid) rn
5 from t
6 )
7 where rn = 1
8 /

Table created.

Elapsed: 00:00:10.93
ops$tkyte@ORA9IR2> drop table t;

Table dropped.

Elapsed: 00:00:00.56
ops$tkyte@ORA9IR2> rename t2 to t;

Table renamed.

Elapsed: 00:00:00.01
ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2> select count(*), count(distinct cust_seg_nbr) from t;

COUNT(*) COUNT(DISTINCTCUST_SEG_NBR)
---------- ---------------------------
756667 756667

Elapsed: 00:00:01.18
ops$tkyte@ORA9IR2>



Rating

  (35 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

stats ...

Christo Kutrovsky, January 30, 2004 - 3:44 pm UTC

Very neat way of removing duplicates, however there's one little detail.

If the table has statistics, the CBO chooses a FULL TABLE SCAN the second time too , as it always predics a selectivity of 1 (i.e. always thinks it will return all rows).

Thus, the analytics way will run with the same performance as the group by one, IF THE TABLE IS ANALYZED.

The work arround is to use the RULE hint:

delete /*+RULE*/ from t
where rowid in ( select rid
from ( select rowid rid,
row_number() over
(partition by cust_seg_nbr order by rowid) rn
from t
)
where rn <> 1 )




Tom Kyte
January 30, 2004 - 8:15 pm UTC

I surely did not experience that on my example....

rule or cbo -- got the same plans actually.

Follow up

Christo Kutrovsky, February 02, 2004 - 11:24 am UTC

I did not see the analyze in your example?

I see:
- create table t as ...
- select ... from t
- delete ... from t



Tom Kyte
February 02, 2004 - 12:26 pm UTC

I went back and analyzed before following up.

I used the CBO in my response to you. I did not run with tkprof the first time, went back and did it over.

Clarification Please

Mark Macke, February 03, 2004 - 11:38 am UTC

Very interesting approach to delete of duplicate rows which I would like to use myself. To apply to my situations I would appreciate some clarification:

1. Should the PARTITION BY clause contain ALL of the columns used to define a row as a duplicate?

2. Can the ORDER BY clause be something other than rowid?

3. Why is this better than the older method
delete from t
where rowid <> ( select min(rowid) from t t2
where t2.primary_key_column1 = t.primary_key_column1
and t2.primary_key_column2 = t.primary_key_column2
....
and t2.primary_key_columnN = t.primary_key_columnN );


Tom Kyte
February 03, 2004 - 1:47 pm UTC

1) yes.

2) yes.

3) that is a correlated subquery. think about what it'll do -- row by row -- in many cases. This is more "en-mass"

VP, March 18, 2004 - 6:47 pm UTC


removing duplicate rows

A reader, February 27, 2006 - 1:58 pm UTC

I have this query which is running very slow some 2 records per min of which I have to delete some 12000 records.

DELETE FROM comp
WHERE rowid not in
(SELECT MIN(rowid)
FROM comp GROUP BY companyid, agentid, class , status, terminationdate );
can you suggest any fast method.



Tom Kyte
February 27, 2006 - 2:43 pm UTC

likely:

delete from t
where rowid IN ( select rid
from (select rowid rid,
row_number() over (partition by
companyid, agentid, class , status, terminationdate
order by rowid) rn
from t)
where rn <> 1;




A reader, February 27, 2006 - 2:47 pm UTC

shoul i use in or not in ???

Tom Kyte
February 27, 2006 - 3:11 pm UTC

if you use my technique you are deleteing WHERE IN, you are generating the set of rowids to REMOVE.

Using your technique (i'll guess -rule based optimizer :( ) you are generating the set of rowids to keep and doing it EACH TIME with the RBO for EACH ROW.

removing duplicate rows

A reader, March 01, 2006 - 2:01 pm UTC

Tom,
a) Regarding RBS, can we write a PL/SQL and commit every 1000/10000 rows ?
b) Will you recommend doing this? If no, why ?


Tom Kyte
March 02, 2006 - 8:24 am UTC

a) yes, but it is "hard"
b) not in general no. If you have access to any of my books, I write extensively on the "why"

to remove duplicate records

chinns, March 07, 2006 - 6:08 am UTC

hai tom,
       This is chinns, for deletion of duplicate rows you given the following query

SQL> delete from emp
   where rowid in ( select rid from ( select rowid rid,row_number() over
                                      (partition by empno order by rowid) rn from emp)
                     where rn <> 1 );

16385 rows deleted.

Elapsed: 00:00:21.56
SQL>


It is taking much time to delete records when compared with my query.

SQL> delete from emp a where rowid not in(select max(rowid) from emp b where a.empno=b.empno);

16385 rows deleted.

Elapsed: 00:00:17.68


Table
=====
CREATE TABLE EMP
(
  EMPNO     NUMBER(4)                           NOT NULL,
  ENAME     VARCHAR2(10),
  JOB       VARCHAR2(9),
  MGR       NUMBER(4),
  HIREDATE  DATE,
  SAL       NUMBER(7,2),
  COMM      NUMBER(7,2),
  DEPTNO    NUMBER(2)
)

SAME RECORD INSERTED 16386 TIMES.
=================================
insert into emp values(1,'kris','sysdba',1,'12-jan-2006','10000',10,20)


SQL> select count(*), count(distinct deptno) from emp;

  COUNT(*) COUNT(DISTINCTDEPTNO)
---------- ---------------------
     16386                     1

Elapsed: 00:00:00.08

Can you tell me why it is happening? Iam a oracle learner.
Advance Thanks. 

Tom Kyte
March 08, 2006 - 4:27 pm UTC

I would scarcely say that based on wall clock timings using sqlplus. At 22 and 18 seconds, they look pretty comparable to me.



comparing two tables

shashidhat B kallur, March 16, 2006 - 10:59 am UTC

hi,
how to compare two tables that is haveing records in terabytes and some of the coumns are haveing different data types and different columns names pls explain the query,it will be very usefull furthere processing...

thanks lot,
shashi

Tom Kyte
March 16, 2006 - 2:42 pm UTC

first, you answer to me: how do you compare an apple to an orange.

if the column types are DIFFERENT, I can answer the question immediately: the data is different. I don't even need to look at it.


</code> https://www.oracle.com/technetwork/issue-archive/2005/05-jan/o15asktom-084959.html <code>
has a query to compare two tables, however, for terabytes - it will take a bit of work (and temp) of course.

You will have to provide FUNCTIONS (to_char, whatever) that map the types to the same types.

And one hopes you know what columns to compare, regardless of their names!

A reader, March 22, 2006 - 8:58 pm UTC


slow update

jas, July 26, 2006 - 6:53 pm UTC

I have this query which is upadting 22,000 records in 3 hrs
how can i use over partition by to fast this query.



slow update

A reader, July 26, 2006 - 6:54 pm UTC

I have this query which is upadting 22,000 records in 3 hrs
how can i use over partition by to fast this query.

UPDATE t_c_a tca
SET (tca.ag_id, tca.agy_date, tca.flag) =
(
select
cao.ag_id,
x.max_date_in,
'Y'

from
t_stg ohs,
cl cao,
(
SELECT
a.c_num,
a.acc_id,
ca.c_id,
MAX (aoh.date_in) max_date_in,
aoh.date_out
FROM
t_ aoh,
ACC a,
t_c_a ca
WHERE
aoh.c_num = a.c_num
AND a.acc_id = ca.acc_id
and aoh.date_out is null
group by
a.c_num,
a.acc_id,
ca.c_id,
aoh.date_out
order by
a.c_num,
ca.c_id
) x
where
ohs.c_num = x.c_num
and rtrim(upper(cao.o_code)) = rtrim(upper(ohs.o_code))
and ohs.date_in = x.max_date_in
and tca.acc_id = x.acc_id
and tca.c_id = x.c_id
);


This is just what I was looking for

Suzanne, October 27, 2006 - 9:10 pm UTC

It took me a little bit to figure out the rownum() over partition by. This is a very elegant solution. Thank you.

I also enjoyed your presentation at OpenWorld.

removing duplicates from a string

ht, February 22, 2007 - 5:37 pm UTC

Tom,
How can I remove duplicates from a string? I would like to only display 'a@z.com' once:

select 'a@z.com,b@z.com,c@z.com,a@z.com' from dual;

'A@Z.COM,B@Z.COM,C@Z.COM,A@Z.CO
-------------------------------
a@z.com,b@z.com,c@z.com,a@z.com

1 row selected.

Thanks,
ht

almost there but is there a better way?

ht, February 23, 2007 - 4:13 pm UTC

Hi Tom,
There's probably a simpler way to accomplish this but, with the pl/sql below, is there a way to reconstruct the string with only distinct values?

Basically, I would like to transform:

'abc@c.com,x@c.com,yxx@c.com,x@c.com,z@c.com,'

into

'abc@c.com,x@c.com,yxx@c.com,z@c.com,' (no duplicate x@c.com entry)

Since I'm now able to display each entry independently, is there a more elegant solution than taking the output and inserting it into a table, selecting distinct out of it, and reconstructing the string?

declare
2 l_string1 varchar2(2000);
3 l_string2 varchar2(2000);
4 l_count number:=0;
5 l_number_of_occurrences number;
6 --
7
8 begin
9 l_string1:='abc@c.com,x@c.com,yxx@c.com,x@c.com,z@c.com,';
10 -- number of occurrences
11 select
12 (length(l_string1) - nvl(length(replace(l_string1,',','')),0)) /
13 length(',') cnt
14 into
15 l_number_of_occurrences
16 from
17 dual;
18
19 for i in 1 .. l_number_of_occurrences loop
20 if l_count=0 then
21 l_string2:=
22 substr
23 (
24 l_string1,
25 1,
26 instr(l_string1,',',1,1)-1
27 );
28 else
29 l_string2:=
30 -- position,length
31 substr
32 (
33 l_string1,
34 -- occurrence,position
35 instr(l_string1,',',1,l_count)+1,
36 instr(l_string1,',',1,l_count+1)-instr(l_string1,',',1,l_count)-1
37 );
38 end if;
39 l_count:=l_count+1;
40 dbms_output.put_line(l_count||':'||l_string2);
41 end loop;
42
43 -----------------------------------------------------
44 end;
45 /
1:abc@c.com
2:x@c.com
3:yxx@c.com
4:x@c.com
5:z@c.com


Tom Kyte
February 26, 2007 - 12:39 pm UTC

ops$tkyte%ORA10GR2> variable txt varchar2(4000)
ops$tkyte%ORA10GR2>
ops$tkyte%ORA10GR2> exec :txt := 'abc@c.com,x@c.com,yxx@c.com,x@c.com,z@c.com,'

PL/SQL procedure successfully completed.

ops$tkyte%ORA10GR2> column t format a15
ops$tkyte%ORA10GR2> column str format a50
ops$tkyte%ORA10GR2>
ops$tkyte%ORA10GR2> with
  2  data
  3  as
  4  (
  5  select token, rownum rn
  6    from (
  7  select distinct
  8    trim( substr (txt,
  9          instr (txt, ',', 1, level  ) + 1,
 10          instr (txt, ',', 1, level+1)
 11             - instr (txt, ',', 1, level) -1 ) )
 12      as token
 13     from (select ','||ltrim(rtrim(:txt,','),',')||',' txt
 14             from dual)
 15   connect by level <=
 16      length(txt)-length(replace(txt,',',''))-1
 17   )
 18  )
 19  select ltrim( max( sys_connect_by_path(token,',') ), ',' ) || ','  str
 20     from data
 21    start with rn = 1
 22  connect by prior rn = rn-1
 23  /

STR
--------------------------------------------------
abc@c.com,x@c.com,yxx@c.com,z@c.com,


see, sometimes SQL can be fun :)

Thanks

A reader, February 26, 2007 - 2:45 pm UTC

I knew you had the solution. I'm still trying to digest the code though ...
Tom Kyte
February 26, 2007 - 3:35 pm UTC

You read my mind again

ht, February 26, 2007 - 5:12 pm UTC

Hi Tom,
Before reading your blog, I created a function using your code. Is the view/context method you published a better performing solution?

>create or replace package test_pkg2
2 as
3 function parse_dupes
4 (
5 in_string in long
6 )
7 return varchar2;
8 end test_pkg2;
9 /
>show errors;
No errors.
>----------------------------
>create or replace package body test_pkg2
2 as
3 function parse_dupes
4 (
5 in_string in long
6 )
7 return varchar2
8 is
9 l_string long;
10 begin
11 --------------
12 select
13 ltrim
14 (
15 max
16 (
17 sys_connect_by_path
18 (
19 value,
20 ','
21 )
22 ), ','
23 ) || ',' str
24 into l_string
25 from
26 ----------------------------------------------
27 (
28 select
29 value,
30 rownum rn
31 from
32 (
33 -- list distinct values in a vertical column list
34 select distinct trim
35 (
36 substr
37 (
38 in_string,
39 instr (in_string, ',', 1, level ) + 1,
40 instr (in_string, ',', 1, level+1) - instr (in_string, ',', 1, level) -1
41 )
42 ) as value
43 from
44 (
45 select ','||ltrim(rtrim(in_string,','),',')||',' in_string
46 from dual
47 )
48 connect by level <= length(in_string)-length(replace(in_string,',',''))-1
49 )
50 )
51 ----------------------------------------------
52 start with
53 rn = 1
54 connect by prior
55 rn = rn-1;
56
57 return l_string;
58 end parse_dupes;
59 end test_pkg2;
60 /
>show errors;
No errors.

Tom Kyte
February 26, 2007 - 5:21 pm UTC

i would go with pure sql if i had pure sql to go with.

and we do....

santosh

santosh, June 12, 2008 - 9:16 am UTC

how can we select/delete 7th record from the table?
Tom Kyte
June 12, 2008 - 9:37 am UTC

define what your concept of "7th record" is first.

rows are not stored in any particular order really - there is no concept of the "7th record". So, you'll need to define that concept for us.

does not work in Oracle 9.0.,2

Rahul, April 21, 2009 - 1:35 pm UTC

SELECT id,
trim( SUBSTR( txt, instr( txt, '#', 1, level ) + 1, instr(
txt, '#', 1, level+1 ) - instr( txt, '#', 1, level ) -1 ) )
AS token
FROM
( SELECT REPLACE(
SUBSTR( txt, 0, instr( txt, ',' )-1 ), '#', '' ) AS id,
'#' || SUBSTR( txt, instr( txt, ',' )+1 ) AS txt
FROM
( SELECT '#' ||'2,first#second#third#fourth' ||'#' txt
FROM dual
)
)
CONNECT BY level <= LENGTH( '2,first#second#third#fourth' )
- LENGTH( REPLACE( '2,first#second#third#fourth','#', '' ) )+1

works in Oracle Database 10g Enterprise Edition Release 10.2.0.2.0 - 64bi

but does not work in 9.0.2
is there a work around in Oracle 9.0.2
Tom Kyte
April 21, 2009 - 3:49 pm UTC

try

with data
as
(that_query)
select * from data;


to delete duplicate data

Demetrios Kakavas, April 29, 2009 - 5:35 pm UTC

The source data that is loaded into our tables apparently do not have an "upper" constraint nor does the table in our database. We have duplicate records but the issue is..

example:

JOHN T SMITH
John T Smith
john t smith

I have put together the following sql:

select val_univ_id, upper(val_nam_first), upper(val_nam_last)
from pers a
where rowid >
(select min(rowid) from pers b
where upper(b.val_nam_first) = upper(a.val_nam_first)
and upper(b.val_nam_last) = upper(a.val_nam_last))
order by upper(val_nam_last);

However, my results are not pulling all the duplicate scenarios.

Any ideas?

Thanks.

Demetrios

Tom Kyte
April 29, 2009 - 9:53 pm UTC

well, if you give me

a) create table
b) inserts into said table
c) query to return data that is not "duplicated"

we can start from there.



ops$tkyte%ORA10GR2> create table t ( u_id number, fname varchar2(10), lname varchar2(10) );

Table created.

ops$tkyte%ORA10GR2>
ops$tkyte%ORA10GR2> insert into t values ( 1, 'JOHN', 'SMITH' );

1 row created.

ops$tkyte%ORA10GR2> insert into t values ( 2, 'john', 'SMITH' );

1 row created.

ops$tkyte%ORA10GR2> insert into t values ( 3, 'john', 'smith' );

1 row created.

ops$tkyte%ORA10GR2> insert into t values ( 4, 'fred', 'smed' );

1 row created.

ops$tkyte%ORA10GR2> insert into t values ( 5, 'fred', 'smed' );

1 row created.

ops$tkyte%ORA10GR2>
ops$tkyte%ORA10GR2> select *
  2    from (select u_id, fname, lname, row_number() over (partition by upper(fname), upper(lname) order by 'whatever' ) rn
  3            from t)
  4   where rn != 1
  5  /

      U_ID FNAME      LNAME              RN
---------- ---------- ---------- ----------
         5 fred       smed                2
         1 JOHN       SMITH               2
         3 john       smith               3

Removing Duplicates that arise from self-join

Mukund Nampally, February 09, 2010 - 11:38 pm UTC

Hi Tom,
I did a self-join on a Table to find the Duplicate Pairs.But when I do a self join the same pair is retrieved twice.To eliminate this I used rowid.The results are satisfactory so far but need your comments on this and is there any way to eliminate this.
Example:
create table clients(
clnt_client_id varchar2(10) primary key,clnt_fname varchar2(10),clnt_lname varchar2(10),clnt_ssn varchar2(9),clnt_dob date)
inserts:
insert into clients values ('123','john','davis','123983471','7-jul-1980');
insert into clients values ('321','john','davis','123983471','7-jul-1980');

insert into clients values ('890','michael','arthur','123983473','7-jun-1980');
insert into clients values ('891','michael','arthur','123983473','7-jun-1980');

insert into clients values ('567','susan','robert','123983471','7-jul-1980');
insert into clients values ('999','susan','robert','123983471','7-jul-1980');
insert into clients values ('870','susan','robert','123983471','7-jul-1980');

query:
select * from clients c1, clients c2
where c1.clnt_client_id <> c2.clnt_client_id
and c1.clnt_fname = c2.clnt_fname
and c1.clnt_lname = c2.clnt_lname
and c1.clnt_ssn = c2.clnt_ssn
and c1.clnt_dob = c2.clnt_dob

The result would be client 123 would join to client 321
client 321 would join client 123

But Indeed these are same pairs and I want to eliminate them and resorted to c1.rowid > c2.rowid and now i'm able to get only one pair.

I want to know if there is a way to know which pair would be selected by oracle.Also if this is the best way to get to this

Thank you

Tom Kyte
February 15, 2010 - 3:07 pm UTC

the rowid technique works, you could use your own primary key as well - clnt_client_id - in exactly the same manner.


You know which "pair" is going to be selected in all cases, the one with the greater rowid - it doesn't "mean" anything, but that is what you are getting.

kevin, August 31, 2010 - 5:41 am UTC

what about this query???

create table new_table as (select * from old_table union select * from old_table);

It will give only distinct rows...

Thx

Tom Kyte
September 09, 2010 - 11:38 am UTC

why not just

select distinct * from old_table;



your approach would "work" but would be very very inefficient

Duplicate set

A reader, June 16, 2011 - 6:13 am UTC

CREATE TABLE TEST_DUPLICATE
(
L_NAME VARCHAR2(20 BYTE),
F_NAME VARCHAR2(20 BYTE),
ADDRESS VARCHAR2(20 BYTE),
PHONE VARCHAR2(20 BYTE),
ID NUMBER(3)
);

SET DEFINE OFF;
Insert into TEST_DUPLICATE
(L_NAME, F_NAME, ADDRESS, PHONE, ID)
Values
('A', 'B', 'ADD1', '1234', 1);
Insert into TEST_DUPLICATE
(L_NAME, F_NAME, ADDRESS, PHONE, ID)
Values
('A', 'B', 'ADD2', '1234', 2);
Insert into TEST_DUPLICATE
(L_NAME, F_NAME, ADDRESS, PHONE, ID)
Values
('A', 'C', 'ADD5', '456', 3);
Insert into TEST_DUPLICATE
(L_NAME, F_NAME, ADDRESS, PHONE, ID)
Values
('A', 'B', 'ADD1', '1234', 4);
Insert into TEST_DUPLICATE
(L_NAME, F_NAME, ADDRESS, PHONE, ID)
Values
('A', 'B', 'ADD1', '1235', 5);
Insert into TEST_DUPLICATE
(L_NAME, F_NAME, ADDRESS, PHONE, ID)
Values
('A', 'C', 'ADD3', '457', 6);
COMMIT;

Need to get duplicate records based on l_name and f_name, in this case records (1,2,4,5) belongs to one set and (3,6) belongs to second set. In test_distinct table populate distinct record and assign new id.
CREATE TABLE SCOTT.TEST_DISTINCT
(
NEW_ID NUMBER(3),
L_NAME VARCHAR2(20 BYTE),
F_NAME VARCHAR2(20 BYTE)
);
SET DEFINE OFF;
Insert into TEST_DISTINCT
(NEW_ID, L_NAME, F_NAME)
Values
(50, 'A', 'B');
Insert into TEST_DISTINCT
(NEW_ID, L_NAME, F_NAME)
Values
(51, 'A', 'C');
COMMIT;

In following table wanted to create new id per a set and associate all prvious id with the new id.
CREATE TABLE SCOTT.TEST_DUPLICATE_RECORD
(
NEW_ID NUMBER(3),
PREVIOUS_ID NUMBER(3)
);

SET DEFINE OFF;
Insert into TEST_DUPLICATE_RECORD
(NEW_ID, PREVIOUS_ID)
Values
(50, 1);
Insert into TEST_DUPLICATE_RECORD
(NEW_ID, PREVIOUS_ID)
Values
(50, 2);
Insert into TEST_DUPLICATE_RECORD
(NEW_ID, PREVIOUS_ID)
Values
(50, 4);
Insert into TEST_DUPLICATE_RECORD
(NEW_ID, PREVIOUS_ID)
Values
(50, 5);
Insert into TEST_DUPLICATE_RECORD
(NEW_ID, PREVIOUS_ID)
Values
(51, 6);
Insert into TEST_DUPLICATE_RECORD
(NEW_ID, PREVIOUS_ID)
Values
(51, 3);
COMMIT;

Tom Kyte
June 17, 2011 - 1:35 pm UTC

here is a start:

ops$tkyte%ORA11GR2> select l_name, f_name, address, phone, id, dense_rank() over (order by l_name, f_name) grp from test_duplicate;

L_ F_ ADDRESS    PHONE              ID        GRP
-- -- ---------- ---------- ---------- ----------
A  B  ADD1       1235                5          1
A  B  ADD2       1234                2          1
A  B  ADD1       1234                1          1
A  B  ADD1       1234                4          1
A  C  ADD3       457                 6          2
A  C  ADD5       456                 3          2

6 rows selected.




I have no idea what you mean by the second two things. So you want those tables with that data - great. Not sure what I'm supposed to do with it.


duplicate set

A reader, June 19, 2011 - 3:15 am UTC

thanks. need your help to populate these two tables.
Tom Kyte
June 20, 2011 - 9:59 am UTC

well, if you look at my query above and you select out just the columns you need and DISTINCT it, you get your first table you wanted (except I don't know how I was to get 50 and 51 as ids, I like 1 and 2 better - so since you didn't tell me how to get 50 and 51, what the logic is - you can do that part)

and then if you just select the last two columns - you get your second table.


Why would you need my help for that - it is pretty clear, you have the query that generates 100% of everything you need, you just need to select from it.

Duplicate Issue

Saurabh, August 18, 2011 - 3:01 am UTC

Hi Tom,

Following is the Data in the table:

Col1 Col2 Col3
A B C
B A C

I want only one of the rows to come in the o/p.Please suggest something in simple SQL.

Thanks and Regards
Saurabh


Tom Kyte
August 23, 2011 - 2:29 am UTC

o/p?


No clue what o/p is or means.

No idea what you are asking...

Duplicate Issue

Saurabh, August 18, 2011 - 4:50 am UTC

Hi Tom,

Further to my above query, during my work-around,I came up with following solution,

select A.* from tab1 A left outer join tab2 B
on (A.col1||A.col2||A.col3)=(B.col2||B.col1||B.col3)
where A.rowid>B.rowid ;

For me the query is giving fine results,But i am not sure about the credibility of this statement.
Please enlighten me with ur ideas..!!!

Thanks and Regards:
Saurabh
Tom Kyte
August 23, 2011 - 2:54 am UTC

well, what if col1 = 'A' and col2 = 'BB' and col3 = NULL in one table and col1 = 'AB' and col2 = null and col3 = 'B' in the other table....


also, the performance of that will be a tad "bad", doing all of that concatentation, cannot you just use "equals"?

To Saurabh

Michel Cadot, August 23, 2011 - 6:58 am UTC


I think you will get profit to take some time on GREATEST and LEAST functions.

Regards
Michel

In many cases better use the with clase

Fabio Poroli, April 06, 2012 - 8:37 am UTC

I was fighting with the same issue quite some time. But then I found my preferred solution for removing duplicate data (works best when the data to remove is limited):

DELETE FROM CLIENTS A WHERE A.ROWID IN
(
WITH CLIENTSWITHMOREID AS
(
SELECT CLIENTID, MIN(ROWID) AS MAXROWID
FROM CLIENTS C
GROUP BY CLIENTID HAVING COUNT(CLIENTID) > 1
)
SELECT A.ROWID "ROWID"
FROM CLIENTS A, CLIENTSWITHMOREID B
WHERE A.CLIENTID=B.CLIENTID AND A.ROWID > B.MAXROWID
)

This performs quite well and is very easy to understand.

query in different behavior when data volume over 100 million rows

A reader, January 09, 2013 - 10:09 am UTC

Hi Tom,

In one of our databases,
 
SQL>select * from v$version;

Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
PL/SQL Release 11.2.0.3.0 - Production
CORE    11.2.0.3.0      Production
TNS for Linux: Version 11.2.0.3.0 - Production
NLSRTL Version 11.2.0.3.0 - Production


a table created by a third party, which has 208 columns, several columns are NOT NULL, total 6 indexes(transid and another column are in a Composite index), no primary key, no unique key.

A third party data load process(no possible to see source code) some times fail and need to delete all duplicated rows. 

the first delete query run about 5 hours to delete about 12% rows (e.g. 15,928,476 from 130,165,183 rows)

Delete from trans.fact_tab A WHERE a.rowid > ( select min(rowid) from trans.fact_tab B where A.transid = B.transid );



the second delete query run about 9.5 hours on the same data:

delete from trans.fact_tab 
      where rowid in ( select rid
                         from ( select rowid rid,
                                       row_number() over
                                         (partition by transid order by rowid) rn
                                  from trans.fact_tab 
                              )
                       where rn <> 1 );


PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 3282996973

-----------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name                  | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------------------
|   0 | DELETE STATEMENT                |                       |     1 |    54 |       |   925K  (1)| 04:37:35 |
|   1 |  DELETE                         | FACT_TAB              |       |       |       |            |          |
|   2 |   NESTED LOOPS                  |                       |     1 |    54 |       |   925K  (1)| 04:37:35 |
|   3 |    VIEW                         | VW_NSO_1              |   116M|  1337M|       |   712K  (1)| 03:33:51 |
|   4 |     SORT UNIQUE                 |                       |     1 |  2787M|       |            |          |
|*  5 |      VIEW                       |                       |   116M|  2787M|       |   712K  (1)| 03:33:51 |
|   6 |       WINDOW SORT               |                       |   116M|  4013M|  4910M|   712K  (1)| 03:33:51 |
|   7 |        TABLE ACCESS STORAGE FULL| FACT_TAB              |   116M|  4013M|       |   273K  (1)| 01:21:59 |
|   8 |    TABLE ACCESS BY USER ROWID   | FACT_TAB              |     1 |    42 |       |     1   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   5 - filter("RN"<>1)

20 rows selected.



I tried the second delete query on a millions table some time ago, it is faster than first one, seems in this case, when data reach to 100 million level, it is slower than another one (I am the only user in the database), I am not sure what cause this, could you please throw some light on it?

Thanks

Tom Kyte
January 14, 2013 - 12:53 pm UTC

stop using DELETE for such a large set of rows - delete is way too expensive for millions of rows like this.

use create table as select to select out the rows you want to keep.

drop the old table

rename the new table

recreate the indexes


all of the above can be done using parallel, skipping all undo generation, bypassing redo generation if you want.

I would suspect you'll be able to do in minutes (seriously) what is currently taking hours.



Creating table/indexes with using parallel, skipping all undo generation, bypassing redo generation

Shimmy, January 15, 2013 - 8:54 am UTC

Is this how we recreate table using parallel, skipping all undo generation, bypassing redo generation?

ALTER SESSION FORCE PARALLEL DML;

CREATE TABLE SK_TBL_TEMP NOLOGGING AS
SELECT /*+PARALLEL(ao 16)*/ owner, object_id, object_name
from all_objects ao;

CREATE INDEX SK_TBL_TEMP_I1 ON SK_TBL_TEMP(object_id) NOLOGGING;



Thank you
Tom Kyte
January 15, 2013 - 2:24 pm UTC

you are doing parallel DDL - not DML, that alter session isn't appropriate


http://docs.oracle.com/cd/E11882_01/server.112/e25523/parallel003.htm#i1006520



CTAS is great

A reader, January 22, 2013 - 10:35 am UTC

I totally agree with you CTAS is the way to go. I did provide CTAS solution(which less than 20 minutes) to the team (which in charge of data load process) at the first place, but they stick to vendor's solution, they scared to death to do any change, you don't believe people like that live on the planet.

Select Distinct 'vs' Not in Duplicate sql

AbuMahmoud, February 25, 2013 - 1:45 am UTC

Dear Tom,
I want to select distinct data from my table in two columns.
So, I have two approaches as follows:

1) Select Action_Month, Action_Year From Hr_Holiday
Where Rowid Not In ( Select Rid
From ( Select Rowid Rid,
Row_Number() Over
(Partition By Action_Month, Action_Year Order By Rowid) Rn
From Hr_Holiday
)
Where Rn <> 1 );

2) Select Distinct Action_Month, Action_Year From Hr_Holiday;

a. Witch one is the best? and Why?

b. When I need to select * from table without duplication in any one or more columns, then I can do the following:
sql> Select * From Hr_Holiday
Where Rowid Not In ( Select Rid
From ( Select Rowid Rid,
Row_Number() Over
(Partition By Action_Month, Action_Year Order By Rowid) Rn
From Hr_Holiday
)
Where Rn <> 1 );
How I can do that /* select * from table */ using distinct in one or more columns?

Tom Kyte
February 25, 2013 - 11:32 am UTC

a) select distinct probably is, I would not have considered analytics at all.

it is intuitive, it says what you want to do, it'll use a single pass on the table.

the analytic approach in this case is totally obscure, it would take someone a long time to figure out what its intent was and it would make multiple passes on the data.





b)
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:122801500346829407

read my original answer and the first followup for two approaches


Removing duplicates from huge table

Prasad Mynumpati, July 02, 2015 - 6:06 am UTC

How about creating another table and have unique key based on the unique record. Export from source and import to newly created table with unique key of duplicate criteria. During import duplication should be eliminated right?

This is Fab

Sandeep, November 19, 2015 - 10:11 pm UTC

Hi

This is excellent approach.
Now, my scale of data is huge, not millions its 30Billion rows table and i need to find the unique records.

I have tested with 50Million records(single column table) and query took less than 4 mins going with this rate my table will take 40 hours ;(. Please suggest a viable approach.

Many Thanks,
Sandeep

Connor McDonald
November 20, 2015 - 3:05 am UTC

So you allowed the table to get to 30 billion rows in size...and THEN you started thinking about the validity of the data :-)

Are we talking LOTS of duplicates ? Because if you've got millions/billions of *duplicate* rows, then you're in a for bad time. What follosw is some ideas for if the number of duplicates is small.

The ideal solution - do a one-off parallel operation with sufficient TEMP and PGA allocated to minimize the costs. If you *cant* do that, then you need to come up with alternative strategies. The process no matter what you do is conceptually:

a- read the data
b- sort/hash the data to effectively "order it"
c- scan the ordered date for duplicates

Its *most* likely the performance pain comes in "b" above, so you need to find ways to mitigate that.

For example, you might have a column that (say) has 10 distinct values, where you know that a duplicate would not SPAN these value. Logically you can then split the task into 10 chunks of work, each being the same SQL but with "where col = value1" , "where col = values2", and so forth as predicates. That might reduce the size of the operation down to something where the sorting will fit into RAM as opposed to spilling to disk etc.

Alternatively, if you have an existing index that matches the column definition for uniqueness, you could walk along that index comparing current and previous rows. It would run for a LONG LONG time (because its walking index entries one by one) but you wont need any large sorting space etc.

This is Fab

Sandeep, November 20, 2015 - 6:29 am UTC

Thanks for the response! I feel logical split of query is the way to go.

Nah, I have not allowed it to grow to 30 Billion and then worrying about validity. Its legacy and it hurts :) We are migrating the history data in a data warehousing environment from System X to System Y and column under question is surrogate key, so many occurrences.

Best,
Sandeep

Connor McDonald
November 20, 2015 - 9:33 am UTC

I hope the cleaning up of the data goes well.

More to Explore

Analytics

Analytic SQL got you confused? Check out Connor McDonald's complete video course.