Skip to Main Content
  • Questions
  • How to extract distinct group of associated records (not exactly hierarchical in nature)

Breadcrumb

Question and Answer

Chris Saxon

Thanks for the question, aks.

Asked: December 12, 2018 - 9:24 pm UTC

Last updated: January 07, 2019 - 6:05 pm UTC

Version: Oracle 12c

Viewed 1000+ times

You Asked

Hello All,

It would be greatly appreciated if someone can provide a way out on how to extract distinct group/array of associated (they are not of parent-child relationship or hierarchical) records as shown below. The same table has been created with sample dataset which you will find in the LiveSQL link below -

https://livesql.oracle.com/apex/livesql/s/hodwakb4rv2siftyzv1juvc4i


CREATE TABLE TEST_DATA
(
SRC_REC NUMBER,
TGT_REC NUMBER
);

INSERT INTO TEST_DATA (SRC_REC, TGT_REC) VALUES(100, 101);
INSERT INTO TEST_DATA (SRC_REC, TGT_REC) VALUES(100, 102);
INSERT INTO TEST_DATA (SRC_REC, TGT_REC) VALUES(103, 102);
INSERT INTO TEST_DATA (SRC_REC, TGT_REC) VALUES(103, 104);
INSERT INTO TEST_DATA (SRC_REC, TGT_REC) VALUES(102, 105);
INSERT INTO TEST_DATA (SRC_REC, TGT_REC) VALUES(108, 110);
INSERT INTO TEST_DATA (SRC_REC, TGT_REC) VALUES(107, 106);
INSERT INTO TEST_DATA (SRC_REC, TGT_REC) VALUES(109, 111);
INSERT INTO TEST_DATA (SRC_REC, TGT_REC) VALUES(111, 107);
COMMIT;



Array 1: 100, 101, 102, 103, 104, 105
Array 2: 106, 107, 111
Array 3: 108, 110


with LiveSQL Test Case:

and Chris said...

This is a form of graph traversal problem. Which is tricky in SQL.

But you can do it using connect by. Or recursive with if you prefer.

For each row, recursively join it to every other row any of the following are true:

- The source = target (and vice-versa)
- The source = source and they have different targets
- The target = target and they have different sources

Include the root row for the source. This will loop back on itself, so you need to add cycle detection.

select src_rec, tgt_rec, 
       connect_by_root src_rec rt
from   test_data
connect by nocycle
  prior src_rec = tgt_rec or
  prior tgt_rec = src_rec or
  ( prior src_rec = src_rec and prior tgt_rec <> tgt_rec ) or
  ( prior tgt_rec = tgt_rec and prior src_rec <> src_rec );

SRC_REC   TGT_REC   RT    
      100       101   100 
      100       102   100 
      103       102   100 
      103       104   100 
      102       105   100 
      102       105   100 
      103       102   100 
      103       104   100 
      100       102   100 
      100       101   100 
      103       102   100 
      103       104   100 
      102       105   100 
      102       105   100 
      103       102   100 
      103       104   100
... 


This will generate the subgraph for all rows. To convert this to lists:

- Unpivot the source and target columns
- Return the resulting distinct root and values
- Use listagg to get a comma-separated list of the subgraph for each root
- Return the distinct list of these CSVs

Which gives:

with tree ( src, tgt, rt ) as (
  select src_rec, tgt_rec, 
         connect_by_root src_rec rt
  from   test_data
  connect by nocycle
    prior src_rec = tgt_rec or
    prior tgt_rec = src_rec or
    ( prior src_rec = src_rec and prior tgt_rec <> tgt_rec ) or
    ( prior tgt_rec = tgt_rec and prior src_rec <> src_rec )
), vals as ( 
  select distinct rt, val
  from   tree
  unpivot ( val for col in ( src, tgt ) )
)
  select distinct listagg ( val , ',' ) 
            within group ( order by val ) grp
  from   vals
  group  by rt;

GRP                       
106,107,109,111           
108,110                   
100,101,102,103,104,105   


Warning: you're generating the tree for every row. So on large data sets this likely to take a while!

Rating

  (1 rating)

Is this answer out of date? If it is, please let us know via a Comment

Comments

'tis the season

Racer I., January 07, 2019 - 10:48 am UTC

Hi,

A version with additional infrastructure which may help if the CONNECT BY runs into memory problems. At the cost of storage/runtime of course.
The outer loop picks an (unassociated so far) node at random and the inner grows a "snowflake" of connected nodes outward from it.
The outermost rims may be quite expensive. Alternatively the inner loop may have many (but fast) iterations for slim but long chains of nodes.
All accesses via index.
I wonder what the spatial/graph option of Oracle brings to the table for problems like this.

drop table graph_edge;

drop table graph_node;

create table graph_edge (id_from NUMBER, id_to NUMBER);

create index idx_gef on graph_edge (id_from);

create index idx_get on graph_edge (id_to);

create table graph_node (id NUMBER, subgroup NUMBER, gdepth NUMBER);

create index idx_gni on graph_node (id);

create index idx_gns on graph_node (subgroup, gdepth);

INSERT INTO graph_edge (id_from, id_to) VALUES(100, 101);
INSERT INTO graph_edge (id_from, id_to) VALUES(100, 102);
INSERT INTO graph_edge (id_from, id_to) VALUES(103, 102);
INSERT INTO graph_edge (id_from, id_to) VALUES(103, 104);
INSERT INTO graph_edge (id_from, id_to) VALUES(102, 105);
INSERT INTO graph_edge (id_from, id_to) VALUES(108, 110);
INSERT INTO graph_edge (id_from, id_to) VALUES(107, 106);
INSERT INTO graph_edge (id_from, id_to) VALUES(109, 111);
INSERT INTO graph_edge (id_from, id_to) VALUES(111, 107);

COMMIT;

insert into graph_node (id, subgroup, gdepth) (SELECT id_from, 0, 1 from graph_edge union SELECT id_to, 0, 1 from graph_edge);

COMMIT;

select * from graph_node;

declare
  vID NUMBER;
  vSG NUMBER := 0;
  vGD NUMBER;
begin
  LOOP
    SELECT NVL(MIN(ID), 0) INTO vID FROM graph_node WHERE subgroup = 0 AND ROWNUM <= 1;
    EXIT WHEN vID = 0;
    vSG := vSG + 1;
    UPDATE graph_node SET subgroup = vSG WHERE ID = vID;
    COMMIT;
    vGD := 1;
    LOOP
      UPDATE graph_node SET subgroup = vSG, gdepth = vGD + 1
      WHERE id IN (
        SELECT gnt.id
        FROM   graph_node gns
          JOIN graph_edge ge ON (gns.id = ge.id_from)
            JOIN graph_node gnt ON (ge.id_to = gnt.id)
        WHERE  gns.subgroup = vSG
        AND    gns.gdepth = vGD
        AND    gnt.subgroup = 0
        UNION
        SELECT gnt.id
        FROM   graph_node gns
          JOIN graph_edge ge ON (gns.id = ge.id_to)
            JOIN graph_node gnt ON (ge.id_from = gnt.id)
        WHERE  gns.subgroup = vSG
        AND    gns.gdepth = vGD
        AND    gnt.subgroup = 0);
      EXIT WHEN SQL%ROWCOUNT = 0;
      vGD := vGD + 1;
    END LOOP;
  END LOOP;
end;
/

select listagg(ID, ',') within group (order by id) grp
from graph_node
group by subgroup
order by subgroup;


GRP
100,101,102,103,104,105
106,107,109,111
108,110

regards,

Chris Saxon
January 07, 2019 - 6:05 pm UTC

Thanks for sharing.

Yes, this does look like a problem that PGQL (Property Graph Query Language) in Spatial is well suited for:

https://blogs.oracle.com/oraclespatial/property-graph-query-language-pgql-support-has-been-added-to-oracle-database-12201

Maybe someone will be able to share a solution using it :)

More to Explore

SQL

The Oracle documentation contains a complete SQL reference.