Skip to Main Content
  • Questions
  • SQL to identify duplicates of data groups.

Breadcrumb

Question and Answer

Connor McDonald

Thanks for the question, Anand.

Asked: November 06, 2017 - 2:47 pm UTC

Last updated: September 24, 2019 - 11:01 am UTC

Version: 11G

Viewed 10K+ times! This question is

You Asked

Hi,
I have a peculiar requirement to identify duplicates of data groups. So here is a sample data set.

        PI         BU PR                   AC
---------- ---------- -------------------- ----------
      1001        100 PR1                  AC1
      1001        100 PR1                  AC2
      1002        100 PR1                  AC1
      1003        100 PR1                  AC1
      1003        100 PR1                  AC2
      1004        100 PR1                  AC2
      1005        100 PR1                  AC1
      1005        100 PR1                  AC2


Of these groups of PI 1001, 1003 and 1005 are considered duplicates since each of those PIs have the same BU, PR, AC combinations. Is there an easy way to identify the groups as duplicates.

Here's the test case setup scripts.

CREATE TABLE T
(
  PI     NUMBER,
  BU     NUMBER,
  PR   VARCHAR2(20 BYTE),
  AC  VARCHAR2(10 BYTE)
);

Insert into T Values (1001, 100, 'PR1', 'AC1');
Insert into T Values (1001, 100, 'PR1', 'AC2');
Insert into T Values (1002, 100, 'PR1', 'AC1');
Insert into T Values (1003, 100, 'PR1', 'AC1');
Insert into T Values (1003, 100, 'PR1', 'AC2');
Insert into T Values (1004, 100, 'PR1', 'AC2');
Insert into T Values (1005, 100, 'PR1', 'AC1');
Insert into T Values (1005, 100, 'PR1', 'AC2');

COMMIT;

SELECT * FROM T;


Thanks in Advance for you help.


and Chris said...

They all seem to have different values in the AC column? Why are they duplicates?

In any case, to find duplicates all you need to do is:

- Group by the expressions which identify duplicate rows
- Return those having a count > 1

For example:

SELECT pi, bu, pr, substr(ac, 1, 2) , count(*)
FROM   T
group  by pi, bu, pr, substr(ac, 1, 2)
having count(*) > 1;

PI     BU    PR    SUBSTR(AC,1,2)   COUNT(*)   
  1001   100 PR1   AC                        2 
  1003   100 PR1   AC                        2 
  1005   100 PR1   AC                        2 


You can read more techniques for finding (and removing!) duplicates at:

https://blogs.oracle.com/sql/how-to-find-and-delete-duplicate-rows-with-sql

Rating

  (8 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

To better explain

Anand Ramanathan, November 06, 2017 - 5:49 pm UTC

The data could be like this.
PI BU PR AC
---------- ---------- -------------------- ----------
1001 100 PR1 AC1
1001 100 AB1 CD2
1002 100 PR1 AC1
1003 100 PR1 AC1
1003 100 AB1 CD2
1004 100 PR1 AC2
1005 100 PR1 AC1
1005 100 AB1 CD2

so for PI 1001 it processed, group 100,PR1,AC1 and group 100, AB1, CD2. The exact same groups that were processed by 1003 and 1005. Then these become duplicates. Is this clearer?

Thanks.
Chris Saxon
November 07, 2017 - 11:57 am UTC

So what exactly are you expecting to see in the output?

In any case, the principle is the same: group by the columns that mark a duplicate and return those with a count > 1.

One possible solution

Dan Blum, November 06, 2017 - 10:30 pm UTC

I believe this will work but may not perform well on large volumes. (However I am not sure what WOULD do this well on large volumes.)

create type number_t as table of number;

with grp as (select g.*, rownum grp_num from (select distinct bu, pr, ac from t) g),
pi_grp as (select t.pi, cast(collect(distinct grp.grp_num order by grp.grp_num) as number_t) grps
             from t inner join grp 
               on t.bu = grp.bu
              and t.pr = grp.pr
              and t.ac = grp.ac
            group by t.pi)
select * from pi_grp p1 inner join pi_grp p2
on p1.pi < p2.pi
and p1.grps = p2.grps;

Chris Saxon
November 07, 2017 - 12:00 pm UTC

It seems a little... complicated. You should be able to do this with a single pass through the table (i.e. no joins). Either using group by or analytics (depending on what exactly is required in the output).

Using analytics

Dan Blum, November 07, 2017 - 3:12 pm UTC

It can definitely be made simpler but I'm not sure you can get away without a join. This will produce the required information to compare PI values:

select pi, cast(collect(distinct group_num order by group_num) as number_t) group_nums from
(select pi, rank() over (order by bu, pr, ac) group_num 
from t) 
group by pi;


The output will have all the PI values together with the set of distinct combinations of the other columns. Assuming I understand the question properly, the OP wants to find all pairs of PI values that have identical sets of combinations. So you can take this and self-join it to compare the sets, but I am not sure what else you can do - you can't order it by the sets.

Chris Saxon
November 08, 2017 - 11:10 am UTC

I'm not sure what you're trying to do with that query?

See the example below for a "no join" method. Though this assumes the OP doesn't need the cross join of the duplicates.

Is there simpler SQL than this that I came up with

Anand Ramanathan, November 07, 2017 - 7:03 pm UTC

I came up with a SQL that gets what I want but this does scale very well on the actual table with has millions of rows.

SQL>l
1 select distinct a.pi, b.pi as dup
2 from (select count(*) cnt, pi from t group by pi ) a,
3 ( select count(*) cnt, pi from t group by pi ) b,
4 t c, t d
5 where a.pi = c.pi
6 and b.pi = d.pi
7 and c.bu = d.bu
8 and c.pr = d.pr
9 and c.ac = d.ac
10 and c.pi != d.pi
11 and a.cnt = b.cnt
12* order by 1
SQL>/

PI DUP
---------- ----------
1001 1003
1001 1005
1003 1001
1003 1005
1005 1001
1005 1003

Thanks
Anand
Chris Saxon
November 08, 2017 - 11:09 am UTC

I don't understand why you exclude 1002 from your duplicate list? 1002 has the same values for bu, pr and ac as 1001, 1003 & 1005? Same for 1004.

Is the rule really:

- Find all the PI values that appear more than once
- For each of these, find any other duplicated PI values which have the same value for BU, PR & AC?

If so, you can do this with a single pass through the table. Use count(*) over() in a subquery to find the duplicated PIs. Then group the results of this to get the matching other rows:

with dup_pis as (
  select t.*, count(*) over ( partition by pi ) ct from t
)
  select bu, pr, ac, 
         listagg(pi, ',') within group (order by pi) pis
  from   dup_pis
  where  ct > 1
  group  by bu, pr, ac
  having count(*) > 1;

BU    PR    AC    PIS              
  100 PR1   AC1   1001,1003,1005   
  100 PR1   AC2   1001,1003,1005  


This is a single pass through the table (instead of 4). So I'd expect this to be notably faster.

If you need the results as separate rows, use your favourite "csv to rows" method. You can find some at:

https://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:9526081800346942316

Thanks

A reader, November 08, 2017 - 11:36 am UTC

I will test this against the actual table to see if it works fine for us.

Btw, I am skipping 1002 because it has only processed (100,PR1,AC1) and not (100,AB1,CD2) and 1004 because It has only processed (100,PR1,AC2) and not (100,PR1,AC1) and these are developer requirements.

Your SQL works great for this.

Thanks again
Anand
Chris Saxon
November 08, 2017 - 3:36 pm UTC

So you want to find all the PI entries which contain rows with exactly the same values for BU, PR & AC as the one that contains the "most" entries?

e.g. if you insert

(1001, 100,AB1,EF3)

then there are no duplicates anymore?

Requirement is different

Anand Ramanathan, November 08, 2017 - 2:21 pm UTC

Hi Chris,
On review of your last sql, you are listing the PIs which have processed a particular group of bu,pr,ac. I need PIs which processed the same groups of bu, pr, ac.

I think I have what I need now. thanks for your time.

Anand

SQL query to find same column name from different tables.

Vishnu, September 24, 2019 - 6:14 am UTC

Do we have any stored procedure or query statement to find column name available in all tables in oracle DB.
Ex: Column 'Policy number'present in 'x' no. of Tables (table names) in oracle DB.

Appreciate your response!!!
Chris Saxon
September 24, 2019 - 11:01 am UTC

You can query the *_tab_cols views to find tables with a given column in the database.

But I'm not sure what this has to do with the original question?

Meaning of original question

Stew Ashton, January 18, 2021 - 11:06 am UTC

This question was automatically tweeted today and caught my attention.

I think I understand what the original requirement was and will try to express it.
- Each value of PI identifies a set of BU / PR / AC values. For example, PI=1001 identifies two rows with values 100 / 'PR1' / 'AC1' and 100 / 'PR1' / 'AC2'.
- Two sets are "duplicates" if they have the same number of rows and every row in one set matches a row in the other set on BU / PR / AC.

This means that the comparison has to be made between sets instead of between rows.

With more recent versions of the Oracle Database (from 12.2 on), JSON functions seem the most efficient way of pulling the data together for comparison.
select listagg(pi, ',') within group(order by pi) pi_duplicates,
  pi_data
from (
  select pi,
    json_arrayagg(json_array(bu,pr,ac null on null) order by bu,pr,ac) pi_data
  from t
  group by pi
)
group by pi_data
having count(*) > 1;

PI_DUPLICATES  PI_DATA
1001,1003,1005 [[100,"PR1","AC1"],[100,"PR1","AC2"]]

Best regards, Stew

More to Explore

SQL

The Oracle documentation contains a complete SQL reference.