Skip to Main Content
  • Questions
  • Performance of oracle hierarchical “connect by prior” or recursive CTE query

Breadcrumb

Question and Answer

Connor McDonald

Thanks for the question, Mo.

Asked: April 14, 2016 - 7:33 pm UTC

Last updated: October 17, 2016 - 12:40 am UTC

Version: 11gR2

Viewed 10K+ times! This question is

You Asked

I need to track the path of documents that get rolled into each other. To do this, I built a hierarchy tree using a connect by prior query. I also rewrote the query as recursive CTE trying to see if I could get better performance. My table has about 190 million records and out of those I will need to track (start with) about 110 million.

TEST TABLE AND DATA:
CREATE TABLE TEST_TABLE
(
    TABLE_KEY INTEGER
  , DOC1 VARCHAR2(39)
  , DOC2 VARCHAR2(39)
  , DOC1_KEY VARCHAR2(50)
  , DOC2_KEY VARCHAR2(50)
  , LOC INTEGER
  , TXN_DATE DATE
);
commit; 

INSERT INTO TEST_TABLE VALUES (1, 'DOC123', 'DOC123', 'DOC1230', 'DOC1230', 111, '18-JAN-2008 13:37:00');
INSERT INTO TEST_TABLE VALUES (2, 'TM123', 'DOC123', 'TM1230111', 'DOC1230', 111, '26-AUG-2008 20:41:29');
INSERT INTO TEST_TABLE VALUES (3, 'TM123', 'TM123', 'TM1230222', 'TM1230111', 222, '29-AUG-2008 09:57:22');
INSERT INTO TEST_TABLE VALUES (4, 'TM123', 'TM123', 'TM1230333', 'TM1230111', 333, '31-MAY-2011 08:46:06');
INSERT INTO TEST_TABLE VALUES (5, 'TM123', 'TM123', 'TM1230444', 'TM1230111', 444, '15-NOV-2011 10:43:44');
INSERT INTO TEST_TABLE VALUES (6, 'FIN123', 'TM123', 'FIN1230', 'TM1230222', 555, '09-APR-2015 07:49:21');
INSERT INTO TEST_TABLE VALUES (7, 'FIN456', 'TM123', 'FIN4560', 'TM1230111', 111, '20-MAY-2015 15:12:59');
INSERT INTO TEST_TABLE VALUES (8, 'FIN222', 'TM123', 'FIN2220', 'TM1230222', 222, '25-MAY-2015 15:12:59');

INSERT INTO TEST_TABLE VALUES (9, 'FIN222', 'TM499', 'FIN2220', 'TM4991111', 222, '27-FEB-2009 16:42:39');
INSERT INTO TEST_TABLE VALUES (10, 'DOC456', 'DOC456', 'DOC4561', 'DOC4561', 111, '20-APR-2012 09:21:59');
INSERT INTO TEST_TABLE VALUES (11, 'TM499', 'DOC456', 'TM4991111', 'DOC4561', 111, '20-APR-2012 09:22:10');
INSERT INTO TEST_TABLE VALUES (12, 'FIN456', 'TM499', 'FIN4560', 'TM4991111', 111, '04-JUN-2012 23:44:44');
commit;


QUERY #1 (connect by prior):

select /*+ parallel 8 */ 
  DOC,
  max(FIN_DOC) keep (dense_rank first order by rownum ) FIN_DOC,
  max(LOC) keep (dense_rank first order by rownum )LOC,
  max(TXN_DATE) keep (dense_rank first order by rownum ) TXN_DATE,
  max(PATH) keep (dense_rank first order by rownum ) PATH,
  max(PATHLEN) keep (dense_rank first order by rownum ) PATHLEN
from 
(
  SELECT 
    CONNECT_BY_ROOT DOC2 as DOC,
    DOC1 as FIN_DOC,
    LOC,
    TXN_DATE,
    trim(SYS_CONNECT_BY_PATH(DOC1, ' ->')) as PATH,
    LEVEL-1 as PATHLEN
  FROM TEST_TABLE 
  WHERE CONNECT_BY_ISLEAF = 1
  START WITH DOC2 in 
  (
    select DOC2
    from TEST_TABLE 
    WHERE DOC2 LIKE 'DOC%'
  )
  CONNECT BY NOCYCLE
    PRIOR DOC1_key = DOC2_KEY 
    AND PRIOR TXN_DATE < TXN_DATE 
  ORDER SIBLINGS BY TXN_DATE
)
GROUP BY DOC;


QUERY #2 (recursive CTE):

WITH RECURSIVE_CTE_QUERY(DOC2_KEY, DOC1_KEY, ROOT_DOC2, DOC1, LOC, TXN_DATE, PATH, PATHLEN) 
AS
(
  SELECT sh.DOC2_KEY,  sh.DOC1_KEY, sh.DOC2 as ROOT_DOC2, sh.DOC1, sh.LOC, sh.TXN_DATE,  
         ('->' || sh.DOC1) as PATH, 
         0 as PATHLEN
  FROM TEST_TABLE sh
  WHERE DOC2 LIKE 'DOC%'

  UNION ALL
  
  SELECT sh2.DOC2_KEY, sh2.DOC1_KEY, rq.ROOT_DOC2, sh2.DOC1, sh2.LOC, sh2.TXN_DATE, 
         (rq.PATH || ' ->' || sh2.DOC1) as PATH,  
         (rq.PATHLEN + 1) as PATHLEN
  FROM TEST_TABLE sh2 
  JOIN RECURSIVE_CTE_QUERY rq
    ON rq.DOC1_KEY = sh2.DOC2_KEY
   AND rq.TXN_DATE < sh2.TXN_DATE
)  SEARCH DEPTH FIRST BY TXN_DATE DESC SET SEQ
    
select /*+ parallel 8 */ 
  ROOT_DOC2 as DOC,
  max(DOC1) keep (dense_rank last order by SEQ) FIN_DOC,
  max(LOC) keep (dense_rank last order by SEQ) LOC,
  max(TXN_DATE) keep (dense_rank last order by SEQ) TXN_DATE,
  max(PATH) keep (dense_rank last order by SEQ) PATH,
  max(PATHLEN) keep (dense_rank last order by SEQ) PATHLEN
from RECURSIVE_CTE_QUERY
GROUP BY ROOT_DOC2;


What I need to do is start with certain types of documents and then track their progression and give the final document that the original document ended up getting rolled into. The same documents can get rolled into other documents multiple times so its important to track just the first path of each. So from the test data above, here are the results I should get (both queries above give the correct results):

DOC     FIN_DOC   LOC  TXN_DATE          PATH                               PATHLEN
------- -------- ----- ----------------- ---------------------------------- --------
DOC123  FIN123    555  4/9/15 07:49:21   ->DOC123 ->TM123 ->TM123 ->FIN123       3 
DOC456  FIN456    111  6/4/12 23:44:44   ->DOC456 ->TM499 ->FIN456               2 


So essentially I'm tracking the path of each document to its first leaf. I have indexes on DOC1, DOC2, DOC1_KEY, DOC2_KEY, TXN_DATE and a bitmap index on LOC. But with so much data, the only index that I see being used is the one of DOC2 (in the sub-query where I'm trying to find which documents to start with).

I have also tried the row_number() over method to take the first leaf after building the tree, but I think using KEEP (DENSE_RANK...) will use less memory and hopefully be faster.

Is there more efficient/simple way of achieving the same results? Or any other performance tweaks that can be made to the query or table used?

Thanks in advance for the help!

and Chris said...

If you're starting with 110M rows, this query is going to take some time whatever you do!

An index over:

(doc2_key, txn_date)

May help with the join in your recursive query.

Another approach is for you to precompute the path, it's length and the root document. Then store these as extra columns on the table. This means you only need to scan the table once. If there's relatively few leaf rows you could add an "is_leaf" column and index it, which may help further.

The downside is you'll need to redo these calculations every time you change the table. If your data changes rarely and you query often, this may be a better approach.

But if you often add/remove rows anywhere other than the bottom of the hierarchy this overhead may not justify the query benefits.

If you need further help with the query as it stands, please post the execution plan for the query against the real tables.

Rating

  (3 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Hierarchical Queries Performance

Sera, October 14, 2016 - 2:08 pm UTC

Tom
The response is very insightful. for the sake of argument, let us say the data undergoes many inserts / updates. Functionally, the requirement for the select query remains the same (though nobody cared how exactly the functionality is met). what are the best options? XML-Table??
Chris Saxon
October 14, 2016 - 3:22 pm UTC

It's Chris actually ;)

what are the best options?

Best options for what? What are you trying to do? What has XMLTable got to do with anything?

How about getting rid of the other siblings?

Stew Ashton, October 15, 2016 - 9:00 am UTC

Cary Millsap is known for saying "the most efficient way of doing something is not to do it" if you don't need to.

Why start with all the documents where DOC2 starts with 'DOC'? You're going to throw away the results for half of them. Why not start from this:
select table_key, doc1, doc2, doc1_key, doc2_key, loc, txn_date
from (
  select a.*,
  row_number() over (partition by doc2 order by txn_date) rn
  from test_table a 
  where doc2 like 'DOC%'
)
where rn = 1;

 TABLE_KEY DOC1    DOC2    DOC1_KEY  DOC2_KEY         LOC TXN_DATE           
---------- ------- ------- --------- --------- ---------- --------------------
         1 DOC123  DOC123  DOC1230   DOC1230          111 18-jan-2008 13:37:00
        10 DOC456  DOC456  DOC4561   DOC4561          111 20-apr-2012 09:21:59

Next, why connect to child documents that have older siblings? How about just keeping the oldest children?
select distinct
  table_key, doc1, doc2, doc1_key, doc2_key, loc, txn_date
from (
  select son.*,
  row_number() over(partition by dad.doc1_key order by son.TXN_DATE) rn
  from test_table dad
  join test_table son
    on dad.doc1_key = son.doc2_key and dad.TXN_DATE < son.TXN_DATE
)
where rn = 1;

 TABLE_KEY DOC1    DOC2    DOC1_KEY  DOC2_KEY         LOC TXN_DATE           
---------- ------- ------- --------- --------- ---------- --------------------
        12 FIN456  TM499   FIN4560   TM4991111        111 04-jun-2012 23:44:44
         6 FIN123  TM123   FIN1230   TM1230222        555 09-apr-2015 07:49:21
         2 TM123   DOC123  TM1230111 DOC1230          111 26-aug-2008 20:41:29
        11 TM499   DOC456  TM4991111 DOC4561          111 20-apr-2012 09:22:10
         3 TM123   TM123   TM1230222 TM1230111        222 29-aug-2008 09:57:22

Now, based on your provided test data, there is no need to GROUP BY any more. I have not tried to create nastier test data. If you come up with a test case where this doesn't work, please post it.
with root_txns as (
  select table_key, doc1, doc2, doc1_key, doc2_key, loc, txn_date
  from (
    select a.*,
    row_number() over (partition by doc2 order by txn_date) rn
    from test_table a 
    where doc2 like 'DOC%'
  )
  where rn = 1
)
, son_txns as (
  select distinct
    table_key, doc1, doc2, doc1_key, doc2_key, loc, txn_date
  from (
    select son.*,
    row_number() over(partition by dad.doc1_key order by son.TXN_DATE) rn
    from test_table dad
    join test_table son
      on dad.doc1_key = son.doc2_key and dad.TXN_DATE < son.TXN_DATE
  )
  where rn = 1
)
SELECT 
  CONNECT_BY_ROOT DOC2 as DOC,
  DOC1 as FIN_DOC,
  LOC,
  TXN_DATE,
  trim(SYS_CONNECT_BY_PATH(DOC1, ' ->')) as PATH,
  LEVEL-1 as PATHLEN
FROM (
  select * from root_txns union all select * from son_txns
)
WHERE CONNECT_BY_ISLEAF = 1
START WITH table_key in (select table_key from root_txns)
CONNECT BY
  PRIOR DOC1_key = DOC2_KEY 
  AND PRIOR TXN_DATE < TXN_DATE;

DOC    FIN_DOC LOC TXN_DATE             PATH                              PATHLEN
DOC123 FIN123  555 09-apr-2015 07:49:21 ->DOC123 ->TM123 ->TM123 ->FIN123 3
DOC456 FIN456  111 04-jun-2012 23:44:44 ->DOC456 ->TM499 ->FIN456         2

Oops!

Stew Ashton, October 15, 2016 - 3:10 pm UTC

I didn't see that this was an old question, and that the latest review was by someone other than the original poster.

Please ignore my previous effort.

Best regards, Stew
Connor McDonald
October 17, 2016 - 12:40 am UTC

Either way... its still good input.

Thanks Stew

More to Explore

Analytics

Analytic SQL got you confused? Check out Connor McDonald's complete video course.