I need to track the path of documents that get rolled into each other. To do this, I built a hierarchy tree using a connect by prior query. I also rewrote the query as recursive CTE trying to see if I could get better performance. My table has about 190 million records and out of those I will need to track (start with) about 110 million.
TEST TABLE AND DATA:CREATE TABLE TEST_TABLE
(
TABLE_KEY INTEGER
, DOC1 VARCHAR2(39)
, DOC2 VARCHAR2(39)
, DOC1_KEY VARCHAR2(50)
, DOC2_KEY VARCHAR2(50)
, LOC INTEGER
, TXN_DATE DATE
);
commit;
INSERT INTO TEST_TABLE VALUES (1, 'DOC123', 'DOC123', 'DOC1230', 'DOC1230', 111, '18-JAN-2008 13:37:00');
INSERT INTO TEST_TABLE VALUES (2, 'TM123', 'DOC123', 'TM1230111', 'DOC1230', 111, '26-AUG-2008 20:41:29');
INSERT INTO TEST_TABLE VALUES (3, 'TM123', 'TM123', 'TM1230222', 'TM1230111', 222, '29-AUG-2008 09:57:22');
INSERT INTO TEST_TABLE VALUES (4, 'TM123', 'TM123', 'TM1230333', 'TM1230111', 333, '31-MAY-2011 08:46:06');
INSERT INTO TEST_TABLE VALUES (5, 'TM123', 'TM123', 'TM1230444', 'TM1230111', 444, '15-NOV-2011 10:43:44');
INSERT INTO TEST_TABLE VALUES (6, 'FIN123', 'TM123', 'FIN1230', 'TM1230222', 555, '09-APR-2015 07:49:21');
INSERT INTO TEST_TABLE VALUES (7, 'FIN456', 'TM123', 'FIN4560', 'TM1230111', 111, '20-MAY-2015 15:12:59');
INSERT INTO TEST_TABLE VALUES (8, 'FIN222', 'TM123', 'FIN2220', 'TM1230222', 222, '25-MAY-2015 15:12:59');
INSERT INTO TEST_TABLE VALUES (9, 'FIN222', 'TM499', 'FIN2220', 'TM4991111', 222, '27-FEB-2009 16:42:39');
INSERT INTO TEST_TABLE VALUES (10, 'DOC456', 'DOC456', 'DOC4561', 'DOC4561', 111, '20-APR-2012 09:21:59');
INSERT INTO TEST_TABLE VALUES (11, 'TM499', 'DOC456', 'TM4991111', 'DOC4561', 111, '20-APR-2012 09:22:10');
INSERT INTO TEST_TABLE VALUES (12, 'FIN456', 'TM499', 'FIN4560', 'TM4991111', 111, '04-JUN-2012 23:44:44');
commit;
QUERY #1 (connect by prior):select /*+ parallel 8 */
DOC,
max(FIN_DOC) keep (dense_rank first order by rownum ) FIN_DOC,
max(LOC) keep (dense_rank first order by rownum )LOC,
max(TXN_DATE) keep (dense_rank first order by rownum ) TXN_DATE,
max(PATH) keep (dense_rank first order by rownum ) PATH,
max(PATHLEN) keep (dense_rank first order by rownum ) PATHLEN
from
(
SELECT
CONNECT_BY_ROOT DOC2 as DOC,
DOC1 as FIN_DOC,
LOC,
TXN_DATE,
trim(SYS_CONNECT_BY_PATH(DOC1, ' ->')) as PATH,
LEVEL-1 as PATHLEN
FROM TEST_TABLE
WHERE CONNECT_BY_ISLEAF = 1
START WITH DOC2 in
(
select DOC2
from TEST_TABLE
WHERE DOC2 LIKE 'DOC%'
)
CONNECT BY NOCYCLE
PRIOR DOC1_key = DOC2_KEY
AND PRIOR TXN_DATE < TXN_DATE
ORDER SIBLINGS BY TXN_DATE
)
GROUP BY DOC;
QUERY #2 (recursive CTE):WITH RECURSIVE_CTE_QUERY(DOC2_KEY, DOC1_KEY, ROOT_DOC2, DOC1, LOC, TXN_DATE, PATH, PATHLEN)
AS
(
SELECT sh.DOC2_KEY, sh.DOC1_KEY, sh.DOC2 as ROOT_DOC2, sh.DOC1, sh.LOC, sh.TXN_DATE,
('->' || sh.DOC1) as PATH,
0 as PATHLEN
FROM TEST_TABLE sh
WHERE DOC2 LIKE 'DOC%'
UNION ALL
SELECT sh2.DOC2_KEY, sh2.DOC1_KEY, rq.ROOT_DOC2, sh2.DOC1, sh2.LOC, sh2.TXN_DATE,
(rq.PATH || ' ->' || sh2.DOC1) as PATH,
(rq.PATHLEN + 1) as PATHLEN
FROM TEST_TABLE sh2
JOIN RECURSIVE_CTE_QUERY rq
ON rq.DOC1_KEY = sh2.DOC2_KEY
AND rq.TXN_DATE < sh2.TXN_DATE
) SEARCH DEPTH FIRST BY TXN_DATE DESC SET SEQ
select /*+ parallel 8 */
ROOT_DOC2 as DOC,
max(DOC1) keep (dense_rank last order by SEQ) FIN_DOC,
max(LOC) keep (dense_rank last order by SEQ) LOC,
max(TXN_DATE) keep (dense_rank last order by SEQ) TXN_DATE,
max(PATH) keep (dense_rank last order by SEQ) PATH,
max(PATHLEN) keep (dense_rank last order by SEQ) PATHLEN
from RECURSIVE_CTE_QUERY
GROUP BY ROOT_DOC2;
What I need to do is start with certain types of documents and then track their progression and give the final document that the original document ended up getting rolled into. The same documents can get rolled into other documents multiple times so its important to track just the first path of each. So from the test data above, here are the results I should get (both queries above give the correct results):
DOC FIN_DOC LOC TXN_DATE PATH PATHLEN
------- -------- ----- ----------------- ---------------------------------- --------
DOC123 FIN123 555 4/9/15 07:49:21 ->DOC123 ->TM123 ->TM123 ->FIN123 3
DOC456 FIN456 111 6/4/12 23:44:44 ->DOC456 ->TM499 ->FIN456 2
So essentially I'm tracking the path of each document to its first leaf. I have indexes on DOC1, DOC2, DOC1_KEY, DOC2_KEY, TXN_DATE and a bitmap index on LOC. But with so much data, the only index that I see being used is the one of DOC2 (in the sub-query where I'm trying to find which documents to start with).
I have also tried the row_number() over method to take the first leaf after building the tree, but I think using KEEP (DENSE_RANK...) will use less memory and hopefully be faster.
Is there more efficient/simple way of achieving the same results? Or any other performance tweaks that can be made to the query or table used?
Thanks in advance for the help!
If you're starting with 110M rows, this query is going to take some time whatever you do!
An index over:
(doc2_key, txn_date)
May help with the join in your recursive query.
Another approach is for you to precompute the path, it's length and the root document. Then store these as extra columns on the table. This means you only need to scan the table once. If there's relatively few leaf rows you could add an "is_leaf" column and index it, which may help further.
The downside is you'll need to redo these calculations every time you change the table. If your data changes rarely and you query often, this may be a better approach.
But if you often add/remove rows anywhere other than the bottom of the hierarchy this overhead may not justify the query benefits.
If you need further help with the query as it stands, please post the execution plan for the query against the real tables.