Skip to Main Content
  • Questions
  • Eliminating duplicates conditionally

Breadcrumb

Question and Answer

Tom Kyte

Thanks for the question, Ravi.

Asked: December 23, 2011 - 1:25 pm UTC

Last updated: January 02, 2013 - 8:16 am UTC

Version: 10g

Viewed 1000+ times

You Asked

Hi Tom,

We have a table where we need to pick one of the duplicate rows.

Following is the test case:

CREATE TABLE TEST_DUP (ID number,
device varchar2(10),
caption varchar2(10),
inst_date date);

INSERT INTO TEST_DUP VALUES(10,'DEVICE1','CAPTION1','12-DEC-2011');
INSERT INTO TEST_DUP VALUES(10,'DEVICE2','CAPTION2','12-DEC-2011');
INSERT INTO TEST_DUP VALUES(11,'DEVICE3','CAPTION3','13-DEC-2011');
INSERT INTO TEST_DUP VALUES(11,NULL,NULL,'13-DEC-2011');
INSERT INTO TEST_DUP VALUES(12,NULL,NULL,'13-DEC-2011');
INSERT INTO TEST_DUP VALUES(12,NULL,NULL,'13-DEC-2011');

Following are the rules for selecting one of the 2 duplicate rows given by our business.

if duplicates are found:

1) pick the row with device is not null and caption is not null
and inst_date is not null
2) if device,caption and inst_date NOT NULL in both the rows,
then pick anyone of them
3) if both the rows have NULL values in device,caption and inst_date
then pick any one of them

So, the preference is with device,caption and inst_date NOT NULL or else pick any one of them.

This table would be part of a join with other tables. Could we do this in a single SQL?

Thanks!

and Tom said...

your list is very confusing. There are three points, but I think only two.

1) if device/caption/inst_date is not null - keep any row that satisfies that

2) if you don't fine a row with all are not null - keep any single row.


You actually missed a rule for your data - what happens when device and caption are null but inst_date isn't? You were really specific with the columns - but didn't cover all of the cases?

Assuming you just want the row with 3 non-nulls if it exists, and any single row if it doesn't:


ops$tkyte%ORA11GR2> select *
  2    from (
  3  select id, device, caption, inst_date,
  4         row_number() over (partition by id order by device, caption, inst_date) rn
  5    from test_dup
  6         )
  7   where rn = 1
  8  /

        ID DEVICE     CAPTION    INST_DATE         RN
---------- ---------- ---------- --------- ----------
        10 DEVICE1    CAPTION1   12-DEC-11          1
        11 DEVICE3    CAPTION3   13-DEC-11          1
        12                       13-DEC-11          1


if you wanted to dedup the table itself, the delete would be:

ops$tkyte%ORA11GR2> delete from test_dup
  2  where rowid in
  3  (
  4  select rid
  5    from (
  6  select rowid rid,
  7         row_number() over (partition by id order by device, caption, inst_date) rn
  8    from test_dup
  9         )
 10   where rn <> 1
 11  )
 12  /

3 rows deleted.

ops$tkyte%ORA11GR2> select * from test_dup;

        ID DEVICE     CAPTION    INST_DATE
---------- ---------- ---------- ---------
        10 DEVICE1    CAPTION1   12-DEC-11
        11 DEVICE3    CAPTION3   13-DEC-11
        12                       13-DEC-11



Rating

  (7 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Ravi B, December 23, 2011 - 2:55 pm UTC

Sorry my bad. I dint intend to put that inst_date there.
I got this point where i could select the rows with not null values, but i couldn't get the NULL part.

What if i have couple of more rows like

INSERT INTO TEST_DUP VALUES(15,NULL,NULL,NULL);
INSERT INTO TEST_DUP VALUES(15,NULL,NULL,NULL);
Tom Kyte
December 23, 2011 - 4:50 pm UTC

the order by would put any row with null,null,null dead last - so you would get what you want.

is there a problem with the query provided? doesn't it answer the question?

if not, restate the question correctly from the beginning.

Follow up question

Ravi B, June 27, 2012 - 7:18 pm UTC

Hi Tom,

Considering original test case but adding one more row for id=10.

How do we select a row with latest inst_date for id=10. With current query it picks DEVICE1, CAPTION1. I am looking to pick with inst_date = 14-DEC-2011.

Order by inst_date (asc/desc) doesn't seems to be working

CREATE TABLE TEST_DUP (ID number,
device varchar2(10),
caption varchar2(10),
inst_date date);

INSERT INTO TEST_DUP VALUES(10,'DEVICE1','CAPTION1','12-DEC-2011');
INSERT INTO TEST_DUP VALUES(10,'DEVICE2','CAPTION2','12-DEC-2011');
INSERT INTO TEST_DUP VALUES(10,'DEVICE2','CAPTION2','14-DEC-2011');
INSERT INTO TEST_DUP VALUES(11,'DEVICE3','CAPTION3','13-DEC-2011');
INSERT INTO TEST_DUP VALUES(11,NULL,NULL,'13-DEC-2011');
INSERT INTO TEST_DUP VALUES(12,NULL,NULL,'13-DEC-2011');
INSERT INTO TEST_DUP VALUES(12,NULL,NULL,'13-DEC-2011');

select *
from (
select id, device, caption, inst_date,
row_number() over (partition by id order by device, caption, inst_date) rn
from test_dup
) a
where rn = 1
Tom Kyte
June 28, 2012 - 9:30 am UTC

well, you are only partitioning by id.

within an id, 14-dec-2011 is way down the list:



ops$tkyte%ORA11GR2> select * from test_dup order by device, caption, inst_date DESC;

        ID DEVICE     CAPTION    INST_DATE
---------- ---------- ---------- ---------
        10 DEVICE1    CAPTION1   12-DEC-11
        10 DEVICE1    CAPTION1   12-DEC-11
        10 DEVICE2    CAPTION2   14-DEC-11
        10 DEVICE2    CAPTION2   14-DEC-11
        10 DEVICE2    CAPTION2   12-DEC-11
        10 DEVICE2    CAPTION2   12-DEC-11



so I don't know what you mean - caption1 comes before caption2 - so that takes precedence.

RaviB, June 27, 2012 - 8:44 pm UTC

Tom,

this is what i wanted.
If device, caption,inst_date are all not null then pick the record with latest install date.

How do i accomplish this by modifying this query:

select *
from (
select id, device, caption, inst_date,
row_number() over (partition by id order by device, caption, inst_date) rn
from test_dup
) a
where rn = 1

Thanks,
Ravi
Tom Kyte
June 28, 2012 - 9:33 am UTC

seems you want to order by inst_date first when all three are not null then doesn't it.

not by device, caption, inst_date


something like this:






ops$tkyte%ORA11GR2> select *
  2        from (
  3      select id, device, caption, inst_date,
  4             row_number() over (partition by id order by case when device is not null and caption is not null and
  5                     inst_date is not null then 0 else 1 end,
  6                     inst_date desc, device, caption) rn
  7        from test_dup
  8            ) a
  9       where rn = 1
 10  /

        ID DEVICE     CAPTION    INST_DATE         RN
---------- ---------- ---------- --------- ----------
        10 DEVICE2    CAPTION2   14-DEC-11          1
        11 DEVICE3    CAPTION3   13-DEC-11          1
        12                       13-DEC-11          1



why would you order by device first if you wanted to pick off a maximum date?

followup..

Ravi B, June 28, 2012 - 2:22 pm UTC

Hi Tom,

This data comes to us from different clients. After talking to our business people here is what i got.

1) when device is not null and caption is not null
and inst_date is not null
pick the row with latest inst_date

expected result: for id =10 pick row with
inst_date=14-DEC-2011

2) if a row with latest date has NULL values
for device and caption pick a row which has
caption is not null
(caption always has preference than device)

expected result: for id = 11
pick the row with inst_date=16-DEC-2011

3) if a row where inst_date is null then pick
ANY row which has caption is not null
(caption always has preference than device)

expected result: for id = 12 caption = 'CAPTION1'

To sum it up the general rule is:

- Priority 1: Caption
- Priority 2: Device

Within each priority, latest date will determine the row that we pick.

Here is the data again.

CREATE TABLE TEST_DUP (ID number,
device varchar2(10),
caption varchar2(10),
inst_date date
);

INSERT INTO TEST_DUP VALUES(10,'DEVICE1','CAPTION1','12-DEC-2011');
INSERT INTO TEST_DUP VALUES(10,'DEVICE2','CAPTION2','12-DEC-2011');
INSERT INTO TEST_DUP VALUES(10,'DEVICE2','CAPTION2','14-DEC-2011');

INSERT INTO TEST_DUP VALUES(11,NULL,'CAPTION1','16-DEC-2011');
INSERT INTO TEST_DUP VALUES(11,'DEVICE2',NULL,'17-DEC-2011');
INSERT INTO TEST_DUP VALUES(11,NULL,NULL,'18-DEC-2011');

INSERT INTO TEST_DUP VALUES(12,NULL,'CAPTION1',NULL);
INSERT INTO TEST_DUP VALUES(12,'DEVICE2',NULL,NULL);
INSERT INTO TEST_DUP VALUES(12,NULL,NULL,'18-DEC-2011');
Tom Kyte
June 29, 2012 - 9:59 am UTC

so, use what I coded above and apply your rules to sorting. If you understand what I did above, creating your order by will be easy!! go for it and report back how you did it.

you want to put the rows where all three are not null first (like I did)

and put all rows where the date and caption is not null 2nd (i didn't do that, but you will!)

put any rows where caption is not null 3rd

and any remaining last


and within that sort by the date field - easy.

could you please verify?

Ravi B, June 29, 2012 - 11:41 am UTC

Tom,

This what i did. Results come out good but could you please let me know if there could be any issues?

select *
  from (
        select id, device, caption, inst_date,
               row_number() over (partition by id order by case when device is not null and caption is not null and inst_date is not null then 0 
                       when inst_date is not null and caption is not null then 1 
                       when caption is not null then 2
                       else 3 end,
                       inst_date desc, device, caption) rn
          from test_dup
              ) a
         where rn = 1;


Tom Kyte
July 02, 2012 - 6:45 am UTC

you just did exactly what I did above, but actually started sorting by what you meant to sort by.

So, you are using the technique I told you to use (twice).

Looks good to me.

eliminating rows

Ravi B, December 20, 2012 - 5:44 pm UTC

Hi Tom,

Could you please help me with this.

I have a following table.

drop table test;
create table test (
cat_sw_product_id number,
cat_sw_edition_id number,
cat_sw_version_id number);

insert into test values(1111,null,11);
insert into test values(1111,null,12);
insert into test values(1111,null,13);
insert into test values(2222,null,20);
insert into test values(2222,1,30);
insert into test values(2222,2,40);

commit;

Following are the rules to select the rows.

For a given cat_sw_product_id there could be multiple rows in a table.

1) for a given cat_sw_product_id if there is atleast one row or more rows with cat_sw_edition_id is NOT NULL then pick all the rows with cat_sw_edition_id is NOT NULL and ignore the rows with cat_sw_edition_id IS NULL

2) for a given cat_sw_product_id if all the rows has cat_sw_edition_id is NULL then pick all the rows

Thanks!
Tom Kyte
January 02, 2013 - 8:16 am UTC

ops$tkyte%ORA11GR2> select *
  2    from (
  3  select test.*,
  4         count(*) over (partition by cat_sw_product_id) cnt_all,
  5         count(cat_sw_edition_id) over (partition by cat_sw_product_id) cnt_not_null
  6    from test
  7         )
  8   where cnt_not_null = 0
  9      or (cnt_not_null > 0 and cat_sw_edition_id is not null)
 10  /

CAT_SW_PRODUCT_ID CAT_SW_EDITION_ID CAT_SW_VERSION_ID    CNT_ALL CNT_NOT_NULL
----------------- ----------------- ----------------- ---------- ------------
             1111                                  11          3            0
             1111                                  12          3            0
             1111                                  13          3            0
             2222                 1                30          3            2
             2222                 2                40          3            2

ops$tkyte%ORA11GR2> 

eliminating rows

Ravi B, December 20, 2012 - 5:48 pm UTC

As follow up to my earlier question:

for cat_sw_product_id = 1111 we should pick all the rows.
for cat_sw_product_id = 2222 we should pick all the rows with cat_sw_edition_id in (1,2)

More to Explore

Analytics

Analytic SQL got you confused? Check out Connor McDonald's complete video course.