You Asked

1. We receive data in the following format in a collection. We do not have control on this data.

(1, 'abc')
(1, 'def')
(1, 'pqr')
(2, 'xyz')
(2, 'abc')
(3, 'mno')
(4, 'abc')
(4, 'def')
(4, 'pqr')
(4, 'jkl')

2. rec_tab is the intermediate data. We derive the key (k1, k2, k3 etc) based on some logic. The structure is in the form of <record_num, key, value>.

(1, 'k1', 'abc')
(1, 'k2', 'def')
(1, 'k3', 'pqr')
(2, 'k1', 'xyz')
(2, 'k4', 'abc')
(3, 'k2', 'mno')
(4, 'k1', 'abc')
(4, 'k2', 'def')
(4, 'k3', 'pqr')
(4, 'k5', 'jkl')

3. I have to combine the intermediate data based on record_num in the below format. Also based on key value we have to filter records.
e.g. For record_num 4 we do not want to process k5, so (4, 'k5', 'jkl') will be filtered.

(1, 'k1|abc:k2|def:k3|pqr')
(2, 'k1|xyz:k4|abc')
(3, 'k2|mno')
(4, 'k1|abc:k2|def:k3|pqr')

4. Then I have to collect the duplicate record_num. It will be sufficient if we can say duplicate record exists even if we do not identify the duplicate record number.
e.g. in the above case it should return 1 and 4

I want to determine the duplicate record_num while deriving the key. Please suggest. We have control on data type from step 2 onwards.
The below approach starts from step 2:

CREATE OR REPLACE TYPE my_rec AS OBJECT (record_num NUMBER, key VARCHAR2(10), value VARCHAR2(50));
/

CREATE OR REPLACE TYPE rec_tab IS TABLE OF my_rec;
/

CREATE OR REPLACE TYPE dest_rec AS OBJECT (record_num NUMBER, key_val VARCHAR2(100));
/

CREATE OR REPLACE TYPE dest_tab IS TABLE OF dest_rec;
/

declare
lv_rec rec_tab := rec_tab();
lv_dest_tab dest_tab := dest_tab();
lv_tmp_tab dest_tab := dest_tab();
begin
lv_rec.extend;
lv_rec(lv_rec.last) := my_rec(1, 'k1', 'abc');
lv_rec.extend;
lv_rec(lv_rec.last) := my_rec(1, 'k2', 'def');
lv_rec.extend;
lv_rec(lv_rec.last) := my_rec(1, 'k3', 'pqr');
lv_rec.extend;
lv_rec(lv_rec.last) := my_rec(2, 'k1', 'xyz');
lv_rec.extend;
lv_rec(lv_rec.last) := my_rec(2, 'k4', 'abc');
lv_rec.extend;
lv_rec(lv_rec.last) := my_rec(3, 'k2', 'mno');
lv_rec.extend;
lv_rec(lv_rec.last) := my_rec(4, 'k1', 'abc');
lv_rec.extend;
lv_rec(lv_rec.last) := my_rec(4, 'k2', 'def');
lv_rec.extend;
lv_rec(lv_rec.last) := my_rec(4, 'k3', 'pqr');

SELECT dest_rec(record_num, key_val1)
BULK COLLECT INTO lv_dest_tab
FROM
(SELECT record_num, key_val1, COUNT(1) OVER (PARTITION BY key_val1) cnt
FROM (
SELECT record_num, LISTAGG(key_val, ':') WITHIN GROUP (ORDER BY key_val) key_val1
FROM (SELECT record_num, key||'|'||value as key_val FROM (SELECT * FROM TABLE(lv_rec)))
GROUP BY record_num )
) WHERE cnt > 1;

FOR i IN lv_dest_tab.FIRST .. lv_dest_tab.LAST
LOOP
dbms_output.put_line(lv_dest_tab(i).record_num || ', ' || lv_dest_tab(i).key_val);
END LOOP;

end;
/

and Connor said...

Depending on the volume of the incoming data it *might* be quicker to do it all in PLSQL using associative arrays, eg

SQL> set serverout on
SQL> declare
  2    type t1 is table of varchar2(4000) index by pls_integer;
  3    r1 t1;
  4
  5    type t2 is table of varchar2(100) index by varchar2(4000);
  6    r2 t2;
  7
  8    idx pls_integer;
  9
 10    lv_rec rec_tab := rec_tab();
 11    lv_dest_tab dest_tab := dest_tab();
 12    lv_tmp_tab dest_tab := dest_tab();
 13  begin
 14    lv_rec.extend;
 15    lv_rec(lv_rec.last) := my_rec(1, 'k1', 'abc');
 16    lv_rec.extend;
 17    lv_rec(lv_rec.last) := my_rec(1, 'k2', 'def');
 18    lv_rec.extend;
 19    lv_rec(lv_rec.last) := my_rec(1, 'k3', 'pqr');
 20    lv_rec.extend;
 21    lv_rec(lv_rec.last) := my_rec(2, 'k1', 'xyz');
 22    lv_rec.extend;
 23    lv_rec(lv_rec.last) := my_rec(2, 'k4', 'abc');
 24    lv_rec.extend;
 25    lv_rec(lv_rec.last) := my_rec(3, 'k2', 'mno');
 26    lv_rec.extend;
 27    lv_rec(lv_rec.last) := my_rec(4, 'k1', 'abc');
 28    lv_rec.extend;
 29    lv_rec(lv_rec.last) := my_rec(4, 'k2', 'def');
 30    lv_rec.extend;
 31    lv_rec(lv_rec.last) := my_rec(4, 'k3', 'pqr');
 32
 33  --SELECT dest_rec(record_num, key_val1)
 34  --BULK COLLECT INTO lv_dest_tab
 35  --FROM
 36  --
 37  --(SELECT record_num, key_val1, COUNT(1) OVER (PARTITION BY key_val1) cnt
 38  --FROM (
 39  --SELECT record_num, LISTAGG(key_val, ':') WITHIN GROUP (ORDER BY key_val) key_val1
 40  --FROM (SELECT record_num, key||'|'||value as key_val FROM (SELECT * FROM TABLE(lv_rec)))
 41  --GROUP BY record_num )
 42  --) WHERE cnt > 1;
 43
 44
 45  --FOR i IN lv_dest_tab.FIRST .. lv_dest_tab.LAST
 46  --LOOP
 47  --dbms_output.put_line(lv_dest_tab(i).record_num || ', ' || lv_dest_tab(i).key_val);
 48  --END LOOP;
 49
 50
 51  for i in 1 .. lv_Rec.count loop
 52    if r1.exists(lv_Rec(i).record_num) then
 53      r1(lv_Rec(i).record_num) :=   r1(lv_Rec(i).record_num) || lv_Rec(i).key||'|'||lv_Rec(i).value;
 54    else
 55      r1(lv_Rec(i).record_num) :=   lv_Rec(i).key||'|'||lv_Rec(i).value;
 56    end if;
 57  end loop;
 58
 59  idx := r1.first;
 60  loop
 61    if r2.exists(r1(idx)) then
 62      dbms_output.put_line(idx||' dup with '||r2(r1(idx)));
 63    else
 64      r2(r1(idx)) := idx;
 65    end if;
 66    idx := r1.next(idx);
 67    exit when idx is null;
 68  end loop;
 69
 70  end;
 71  /
4 dup with 1

PL/SQL procedure successfully completed.

but that probably wont scale as well as the SQL option if you've got tens of thousands of rows.

Is this answer out of date? If it is, please let us know via a Comment

More to Explore

Analytics

Analytic SQL got you confused? Check out Connor McDonald's complete video course.

Breadcrumb

Question and Answer

You Asked

and Connor said...

More to Explore

Analytics

Combine collection content by key

Breadcrumb

Question and Answer

You Asked

and Connor said...

More to Explore

Analytics