Skip to Main Content
  • Questions
  • Hierarchical query with count of leave attributes

Breadcrumb

Question and Answer

Chris Saxon

Thanks for the question, Koen.

Asked: September 21, 2017 - 3:20 pm UTC

Last updated: October 30, 2017 - 5:49 pm UTC

Version: 12.1.0.2

Viewed 10K+ times! This question is

You Asked

Hello Experts. I want to calculate the sum of the count of the leaves' attributes in a hierarchical query
create table hq_test (parent_id NUMBER, child_id NUMBER);

INSERT INTO hq_test (parent_id, child_id)  VALUES (25,26);
INSERT INTO hq_test (parent_id, child_id)  VALUES (25,27); 
INSERT INTO hq_test (parent_id, child_id)  VALUES (26,28); 
INSERT INTO hq_test (parent_id, child_id)  VALUES (26,29); 
INSERT INTO hq_test (parent_id, child_id)  VALUES (26,30); 
INSERT INTO hq_test (parent_id, child_id)  VALUES (27,31); 
INSERT INTO hq_test (parent_id, child_id)  VALUES (27,32); 
INSERT INTO hq_test (parent_id, child_id)  VALUES (27,33); 
INSERT INTO hq_test (parent_id, child_id)  VALUES (30,34); 
INSERT INTO hq_test (parent_id, child_id)  VALUES (NULL,25);  

create table hq_test_attr (id NUMBER, val VARCHAR2(100));

INSERT INTO hq_test_attr (id, val) VALUES (26, 'A');
INSERT INTO hq_test_attr (id, val) VALUES (26, 'B');
INSERT INTO hq_test_attr (id, val) VALUES (27, 'C');
INSERT INTO hq_test_attr (id, val) VALUES (27, 'D');
INSERT INTO hq_test_attr (id, val) VALUES (28, 'E');
INSERT INTO hq_test_attr (id, val) VALUES (28, 'F');
INSERT INTO hq_test_attr (id, val) VALUES (28, 'G');
INSERT INTO hq_test_attr (id, val) VALUES (29, 'H');
INSERT INTO hq_test_attr (id, val) VALUES (30, 'I');
INSERT INTO hq_test_attr (id, val) VALUES (31, 'J');
INSERT INTO hq_test_attr (id, val) VALUES (31, 'K');
INSERT INTO hq_test_attr (id, val) VALUES (32, 'L');
INSERT INTO hq_test_attr (id, val) VALUES (33, 'M');
INSERT INTO hq_test_attr (id, val) VALUES (33, 'N');
INSERT INTO hq_test_attr (id, val) VALUES (33, 'O');


I want to get an output of each id with the count of its attributes and the total count of its leaves' attributes walking the entire hierarchy. For example: 33 has 3 attributes (M,N and O) but no leaves; 27 has an attribute count of 2 and a leaves attribute count of 6 (attribute counts of 31,32 and 33).
This can be achieved by scalar subqueries but that performs badly. Can this be done w/ sql ?

Rgds
Koen

and Chris said...

So you want the sum of all the attributes in the subtree below each node, including itself?

When dealing with trees, I find it's helpful to have a diagram showing what you want. So here we go:

TREE_WITH_ATTRS

The numbers inside the boxes are the node ID followed by the number of its attributes (in brackets).

The number above each box is the sum of all the attributes in the subtree below it.

But how do we calculate that?

First, assign each node a sequence number using depth-first search. Showing these on the left of each box in our diagram gives:

TREE_WITH_ATTRS_DEPTH

To get the total below a node, you need to add up the values from itself up to, but not including, the next node that is the same level or lower in the tree.

For example, node 26 is 2nd in the search. You need to add all the attributes between it and 27 which is 7th. So its total is everything between nodes 2 and 6.

That's the concept. Let's translate it into SQL.

First you need to calculate how many attributes each node has. This is a standard outer join (some nodes have no attributes) and group by:

  select h.parent_id, h.child_id, count(a.id) c
  from   hq_test h
  left   join hq_test_attr a
  on     h.child_id = a.id
  group  by h.parent_id, h.child_id;


Next you need to build the tree. I've done this using recursive with. This enables you to specify depth-first search (the default) and assign each node a sequence number in the order you access them:

tree (par, chd, c, tot, pth, lev) as (
  select parent_id, child_id, c, c,
         child_id || '',
         1 lev
  from   counts c
  where  parent_id is null
  union all
  select c.parent_id, c.child_id, c.c, 0 ,
         pth || '|' || c.child_id,
         lev + 1 lev
  from   tree t
  join   counts c
  on     t.chd = c.parent_id
) search depth first by chd set seq


Here's where it gets interesting. You want to "look forward" to the next node where the level <= the current level. You can find this using a subquery like:

select min(s.seq) from tree s
where  s.seq > t.seq 
and    s.lev <= t.lev


But this query will run for each node in the tree. Which may not scale well.

Luckily you're on 12c. So pattern matching can come to the rescue! :)

To use it, define the subtree variable as rows where the level is greater than that of the first row in the match:

pattern ( strt subtree* )
define 
  subtree as lev > first (lev)


You need to do this for every row in the table. But by default match_recognize, after it completes a match, goes to the next row in the data set. This means it'll start at the root, traverse the tree once then stop!

To fix this you need to "go back" to the second row. And again for the third, forth, etc. To do this, add:

after match skip to next row


This enables it to find all the subtrees for each row.

To finish all, you need to do is define the columns you want in your output in the measures clause. To get the total attributes, sum up all the counts. For anything else you want to see, such as the parent, child, etc. return the first() of each column.

Put it all together and you get:

with counts as (
  select h.parent_id, h.child_id, count(a.id) c
  from   hq_test h
  left   join hq_test_attr a
  on     h.child_id = a.id
  group  by h.parent_id, h.child_id
), tree (par, chd, c, tot, pth, lev) as (
  select parent_id, child_id, c, c,
         child_id || '',
         1 lev
  from   counts c
  where  parent_id is null
  union all
  select c.parent_id, c.child_id, c.c, 0 ,
         pth || '|' || c.child_id,
         lev + 1 lev
  from   tree t
  join   counts c
  on     t.chd = c.parent_id
) search depth first by chd set seq
  select * from tree
  match_recognize (
    order by seq
    measures
      first(chd) as child_id,
      first(par) as parent_id,
      first(pth) as pth,
      first (lev) as lev,
      first(c) as attrs,
      sum(c) as subtot,
      first(seq) as seq
    after match skip to next row
    pattern ( strt subtree* )
    define 
      subtree as lev > first (lev)
  )
  order  by seq;

CHILD_ID   PARENT_ID   PTH           LEV   ATTRS   SUBTOT   SEQ   
        25             25                1       0       15     1 
        26          25 25|26             2       2        7     2 
        28          26 25|26|28          3       3        3     3 
        29          26 25|26|29          3       1        1     4 
        30          26 25|26|30          3       1        1     5 
        34          30 25|26|30|34       4       0        0     6 
        27          25 25|27             2       2        8     7 
        31          27 25|27|31          3       2        2     8 
        32          27 25|27|32          3       1        1     9 
        33          27 25|27|33          3       3        3    10 

Rating

  (9 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

For Real? or for sport?

Duke Ganote, October 14, 2017 - 8:40 pm UTC

I always wonder if this is a practical problem, or just some instructor's playground. At any rate, Dave Ballantyne solved a variant of this nicely for the (SQL Server) SQL Challenge #19 of 2009.

My toy, Oracle-ized version follows:

WITH
-- -------------------------------------------------
-- normalize the list of nodes
node AS (-------------------------------------------
SELECT parent_id AS ID
  FROM hq_test
 WHERE parent_id IS NOT NULL
UNION
SELECT child_id AS ID
  FROM hq_test
 WHERE child_id IS NOT NULL
), -------------------------------------------------
-- count attributes per node
attributeCount AS ( --------------------------------
SELECT node.id
     , COALESCE(count(val),0) attrCnt
  FROM node
  LEFT OUTER
  JOIN hq_test_attr
    ON hq_test_attr.ID = node.ID
 GROUP BY node.id
), -------------------------------------------------
-- which nodes are leaves?
leafFinder AS ( ------------------------------------
SELECT child_ID AS leaf_ID
  FROM hq_test "child"
MINUS
SELECT parent_id
  FROM hq_test "parent"
), -------------------------------------------------
-- count leaf attributes
leafThoseKidsAlone AS (-----------------------------
SELECT node.id
     , COALESCE(attributeCount.AttrCnt,0) as attrCnt
     , CASE WHEN leafFinder.leaf_ID IS NOT NULL
            THEN attributeCount.AttrCnt
            ELSE 0
        END AS leafAttrCnt
  FROM node
  LEFT OUTER
  JOIN attributeCount
    ON node.ID = attributeCount.ID
  LEFT OUTER
  JOIN leafFinder
    ON leafFinder.leaf_id = attributeCount.ID
), -------------------------------------------------
-- find paths outward ("branch out")
----------------------------------------------------
BranchingOut ( id, attrCnt, leafAttrCnt, branch
             , lvl ) AS (
SELECT B.ID
     , COALESCE(L.attrCnt,0) AS attrCnt
     , COALESCE(L.leafAttrCnt,0) AS leafAttrCnt
     , ','||B.id AS branch
     , 1
  FROM node B
  JOIN hq_test
    ON B.id = hq_test.child_id
   AND hq_test.parent_ID IS NULL
  JOIN leafThoseKidsAlone L
    ON hq_test.child_id = L.id
UNION ALL
SELECT S.child_id AS ID
     , COALESCE(L.attrCnt,0) AS attrCnt
     , COALESCE(L.leafAttrCnt,0) AS leafAttrCnt
     , BranchingOut.branch||','||S.child_id
     , BranchingOut.lvl + 1
  FROM BranchingOut
  JOIN hq_test S
    ON BranchingOut.id = S.parent_id
  JOIN leafThoseKidsAlone L
    ON S.child_id = L.id
) --------------------------------------------------
-- wrap up and output
----------------------------------------------------
SELECT b.ID, max(b.lvl) as "Level"
     , SUM(b.attrCnt) AS attrCnt
     , SUM(C.leafAttrCnt) AS LeafAttrCnt
  FROM BranchingOut B
  JOIN BranchingOut C
    ON C.branch||',' LIKE '%,'||B.ID||'%'
 GROUP BY b.ID
 ORDER BY 1;

   ID      Level    ATTRCNT LEAFATTRCNT
----- ---------- ---------- -----------
   25          1          0          10
   26          2         10           4
   27          2          8           6
   28          3          3           3
   29          3          1           1
   30          3          2           0
   31          3          2           2
   32          3          1           1
   33          3          3           3
   34          4          0           0


The Wayback Machine has an archived version of the Challenge:
http://tinyurl.com/SQLchallenge19
Connor McDonald
October 15, 2017 - 1:19 pm UTC

Nice input, thanks.

Alternative

Racer I., October 16, 2017 - 7:50 am UTC

Hi,

I wanted to try with connect by :

WITH
AttCnt AS (
  select h.parent_id, h.child_id, 0 c
    from   hq_test h
    group  by h.parent_id, h.child_id
    UNION ALL
    select h.child_id parent_id, -h.child_id child_id, count(a.id) c
    from   hq_test h
      left join hq_test_attr a ON (h.child_id = a.id)
    group  by h.child_id),
CBS AS (    
  SELECT ac.c, CONNECT_BY_ROOT ac.parent_id pid
  FROM   AttCnt ac
  START WITH ac.parent_ID IS NOT NULL
  CONNECT BY ac.parent_id = PRIOR ac.child_id),
HV AS (  
  SELECT pid, SUM(c) val
  FROM CBS
  GROUP BY pid)
SELECT *
FROM   HV
order by pid


Had to turn the attributes into fictive leaf nodes (left joined because 34 has no attributes)

Chris Saxon
October 16, 2017 - 12:31 pm UTC

Nice work, thanks for sharing.

Alternative (corr.)

Racer I., October 16, 2017 - 8:01 am UTC

Hi,

Forgot the nodes own counts :

WITH
AttCnt AS (
  select h.parent_id, h.child_id, 0 c
  from   hq_test h
  UNION ALL
  select h.child_id parent_id, -h.child_id child_id, count(a.id) c
  from   hq_test h
    left join hq_test_attr a ON (h.child_id = a.id)
  group  by h.child_id),
CBS AS (    
  SELECT ac.*, CONNECT_BY_ROOT ac.parent_id pid
  FROM   AttCnt ac
  START WITH ac.parent_ID IS NOT NULL
  CONNECT BY ac.parent_id = PRIOR ac.child_id),
HV AS (  
  SELECT pid, SUM(c) val
  FROM CBS
  GROUP BY pid)
SELECT hv.pid child_id, ac.c AttCnt, hv.val TotalCnt 
FROM   HV
  join AttCnt ac ON (hv.pid = -ac.child_id)
order by pid

CHILD_ID ATTCNT TOTALCNT
25 0 15
26 2 7
27 2 8
28 3 3
29 1 1
30 1 1
31 2 2
32 1 1
33 3 3
34 0 0

Alternative II

Racer I., October 16, 2017 - 9:40 am UTC

Hi,

I only now noticed that the first answer uses recursive with (rw), which should replace connect by. I've never used this so I tried to make it work with that. Saves the extra join to get the nodes own values.
When you list just the rw-subquery it looks quite inefficient unfortunately (way too many rows). Which is probably the reason for the pattern matching in the first answer.

WITH
AttCnt AS (
  select h.parent_id, h.child_id, 0 c
  from   hq_test h
  UNION ALL
  select h.child_id parent_id, -h.child_id child_id, count(a.id) c
  from   hq_test h
    left join hq_test_attr a ON (h.child_id = a.id)
  group  by h.child_id),
RW (pid, parent_id, child_id, c) AS (    
  SELECT parent_id pid, parent_id, child_id, c
  FROM   AttCnt ac
  UNION ALL
  SELECT r.pid, ac.parent_id, ac.child_id, ac.c
  FROM   RW     r
    JOIN AttCnt ac ON (r.child_id = ac.parent_id))
SELECT pid, SUM(CASE WHEN pid=-child_id THEN c ELSE 0 END) AttCnt, SUM(c) TotalCnt
FROM RW
WHERE pid IS NOT NULL
GROUP BY pid
ORDER BY pid


Racer I and Razor Eye

Duke Ganote, October 17, 2017 - 2:13 pm UTC

I've admitted elsewhere that I'm a visual guy; raw SQL means little to me... I have to stare at it (poke and prod) until I tie it to a mental representation. So it was with Racer I's code.

The essential difference I see between Racer I's approach and Dave Ballantyne's is the "all-start" instead of the "root start" while branching outwardly.

Racer I ties each start node to all descendant leaves during the branching-out. Ballantyne concatenates the node-to-leaf paths during the branching-out, then shreds the path in order to sum the leaf-attribute counts for each start node.

I suspect Racer I's approach is more efficient. Certainly it eliminates the potential issue of overrunning the VARCHAR2 buffer while concatenating a very long path. All-in-all, I found his code a good lesson in recursion.

Here's my modification of my code incorporating Racer I's approach:

WITH
-- -------------------------------------------------
-- normalize the list of nodes
node AS (-------------------------------------------
SELECT parent_id AS ID
  FROM hq_test
 WHERE parent_id IS NOT NULL
UNION
SELECT child_id AS ID
  FROM hq_test
 WHERE child_id IS NOT NULL
), -------------------------------------------------
-- count attributes per node
attributeCount AS ( --------------------------------
SELECT node.id
     , COALESCE(count(val),0) attrCnt
  FROM node
  LEFT OUTER
  JOIN hq_test_attr
    ON hq_test_attr.ID = node.ID
 GROUP BY node.id
), -------------------------------------------------
-- which nodes are leaves?
leafFinder AS ( ------------------------------------
SELECT child_ID AS leaf_ID
  FROM hq_test "child"
MINUS
SELECT parent_id
  FROM hq_test "parent"
), -------------------------------------------------
-- count leaf attributes
leafThoseKidsAlone AS (-----------------------------
SELECT node.id
     , COALESCE(attributeCount.AttrCnt,0) as attrCnt
     , CASE WHEN leafFinder.leaf_ID IS NOT NULL
            THEN attributeCount.AttrCnt
            ELSE 0
        END AS leafAttrCnt
  FROM node
  LEFT OUTER
  JOIN attributeCount
    ON node.ID = attributeCount.ID
  LEFT OUTER
  JOIN leafFinder
    ON leafFinder.leaf_id = attributeCount.ID
), -------------------------------------------------
-- find paths outward ("branch out")
----------------------------------------------------
BranchingOut ( startID, id, attrCnt, leafAttrCnt, branch
             , StartLevel, Depth# ) AS (
SELECT B.ID AS startID
     , B.ID AS ID
     , COALESCE(L.attrCnt,0) AS attrCnt
     , COALESCE(L.leafAttrCnt,0) AS leafAttrCnt
     , ','||B.id AS branch
     , 1
     , CASE WHEN hq_test.parent_ID IS NULL 
            THEN 1 
        END as LVL
  FROM node B
  JOIN hq_test
    ON B.id = hq_test.child_id
--   AND hq_test.parent_ID IS NULL -- start everywhere!! <<<<<<<<<<<
  JOIN leafThoseKidsAlone L
    ON hq_test.child_id = L.id
UNION ALL
SELECT BranchingOut.startID
     , S.child_id AS ID
     , COALESCE(L.attrCnt,0) AS attrCnt
     , COALESCE(L.leafAttrCnt,0) AS leafAttrCnt
     , BranchingOut.branch||','||S.child_id
     , BranchingOut.StartLevel + 1 AS StartLevel
     , CASE WHEN BranchingOut.Depth# IS NOT NULL
            THEN BranchingOut.Depth# + 1
        END as Depth#
  FROM BranchingOut
  JOIN hq_test S
    ON BranchingOut.id = S.parent_id
  JOIN leafThoseKidsAlone L
    ON S.child_id = L.id 
) --------------------------------------------------
-- wrap up and output
----------------------------------------------------
select B.StartID, max(fromTop.Depth#) AS DEPTH#
     , sum(B.leafAttrCnt) leafAttrCnt
  FROM branchingOut B
  LEFT OUTER
  JOIN branchingOut fromTop
    ON B.startID = fromTop.ID
   AND fromTop.DEPTH# IS NOT NULL 
 GROUP BY B.startID
 order by 1, 2, 3;

STARTID     DEPTH# LEAFATTRCNT
------- ---------- -----------
     25          1          10
     26          2           4
     27          2           6
     28          3           3
     29          3           1
     30          3           0
     31          3           2
     32          3           1
     33          3           3
     34          4           0

Chris Saxon
October 17, 2017 - 3:33 pm UTC

If you compare the work done by match_recognize, Racer I and Ballantyne solutions you see:

match_recognize
Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          6  consistent gets
          0  physical reads
          0  redo size
       1018  bytes sent via SQL*Net to client
        551  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          9  sorts (memory)
          0  sorts (disk)
         10  rows processed


Racer I
Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          9  consistent gets
          0  physical reads
          0  redo size
        637  bytes sent via SQL*Net to client
        551  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          8  sorts (memory)
          0  sorts (disk)
         10  rows processed


Ballantyne
Statistics
----------------------------------------------------------
          2  recursive calls
         11  db block gets
         35  consistent gets
          1  physical reads
        616  redo size
        636  bytes sent via SQL*Net to client
        551  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
         12  sorts (memory)
          0  sorts (disk)
         10  rows processed


So yes. The Ballantyne method is doing a lot more work!

Though this will be at least partly due to the fact it accesses hq_test far more times than the other solutions. You may be able to make decent gains by changing the leafFinder and node subqueries so they only access it once ;)

Performance considerations

Duke Ganote, October 23, 2017 - 7:41 pm UTC

I'm not sure Ballantyne's approach does more work, at least for a reasonable data set; my implementation, as you suggested, may be simply inefficient for the small data set.

However, if we generate a larger data set, say, about 10,000 employees, like so:

CREATE TABLE emp2 AS
WITH
--------------------------------------------------------------------------------
-- Parameters (single record)
--------------------------------------------------------------------------------
parm AS (
SELECT 10000 AS NewEmployeeCnt   -- start with zero employees, then count
     ,     3 AS NewSubordinateCnt-- start with zero per manager, then count
     ,   100 AS MaxOrderCnt      -- 0..100 orders per employee
  FROM DUAL
),------------------------------------------------------------------------------
-- employees and hierarchy.        PK = employeeID
--------------------------------------------------------------------------------
emp_hier ( employeeID, EmployeeName, ReportsTo
         , NewSubordinateCnt, NewEmployeeCnt, MAXSubCnt
         , OrderCnt, MaxOrderCnt
         ) AS (
SELECT 1 AS employeeID
     , TO_CHAR( to_date(1,'J'),'Jsp') AS EmployeeName
     , 1 AS ReportsTo -- selfish at the top
     , parm.NewSubordinateCnt
     , parm.NewEmployeeCnt
     , parm.NewSubordinateCnt AS MAXSubCnt
     , ROUND(DBMS_RANDOM.VALUE*MaxOrderCnt,0) AS OrderCnt
     , MaxOrderCnt
  FROM parm
UNION ALL
SELECT emp_hier.employeeID + 1
     , TO_CHAR( to_date(emp_hier.employeeID + 1,'J'),'Jsp') AS EmployeeName
     , CASE WHEN NewSubordinateCnt > 1
            THEN emp_hier.ReportsTo
            ELSE emp_hier.ReportsTo + 1
                 + CASE WHEN DBMS_RANDOM.VALUE > 0.2
                        THEN 0 ELSE 1 -- 20% ragged
                    END
        END AS ReportsTo
     , CASE WHEN NewSubordinateCnt > 1
            THEN NewSubordinateCnt - 1
            ELSE ROUND(DBMS_RANDOM.VALUE*emp_hier.MAXSubCnt,0)
        END AS NewSubordinateCnt
     , NewEmployeeCnt - 1 AS NewEmployeeCnt
     , emp_hier.MAXSubCnt
     , ROUND(DBMS_RANDOM.VALUE*MaxOrderCnt,0) AS OrderCnt
     , MaxOrderCnt
  FROM emp_hier
 WHERE NewEmployeeCnt >= 1
)------------------------------------------------------------------------------
-- Pretty-up the results
--------------------------------------------------------------------------------
SELECT employeeID
     ,         SUBSTR(EmployeeName,1,1) -- AS FirstName (initial)
     ||'. '
     ||InitCap(SUBSTR(EmployeeName,2,999)) -- AS LastName
          AS EmployeeName
     , CASE WHEN ReportsTo = employeeID
            THEN NULL ELSE ReportsTo
        END AS ReportsTo
     , OrderCnt
  FROM emp_hier
 ORDER BY 1


Then a Ballantyne-like approach seems much more reasonable:

WITH -- **************** Ballantyne **********************
----------------------------------------------------------
-- top-down the hierarchy. PK = employeeID
----------------------------------------------------------
cteTree ( employeeID, firstName, lastName, ReportsTo
        , by_self, HierPath, "Level", displaysequence
        ) AS (
SELECT emp2.EmployeeID
     ,         SUBSTR(emp2.employeeName,1,1) AS FirstName
     , InitCap(SUBSTR(emp2.employeeName,3,99)) AS LastName
     , emp2.ReportsTo
     , emp2.by_self
     , ',' as HierPath
     , 1 as "Level"
     , TO_CHAR(1,'fm000000') as displaysequence
  FROM emp2
 WHERE emp2.ReportsTo IS NULL -- El Presidente
UNION ALL
SELECT emp2.EmployeeID
     ,         SUBSTR(emp2.employeeName,1,1) AS FirstName
     , InitCap(SUBSTR(emp2.employeeName,3,99)) AS LastName
     , emp2.ReportsTo
     , emp2.by_self
     , cteTree.HierPath||emp2.ReportsTo||',' AS HierPath
     , "Level" +1 as "Level"
     , cteTree.displaysequence
     ||TO_CHAR(ROW_NUMBER()OVER
          (ORDER BY InitCap(SUBSTR(emp2.employeeName,3,99))
                  , SUBSTR(emp2.employeeName,1,1))
          ,'fm000000') as displaysequence
  FROM emp2
  JOIN cteTree
    ON cteTree.EmployeeID = emp2.ReportsTo
),--------------------------------------------------------
-- Tally table for shredding CSV.  PK = node#.
----------------------------------------------------------
tally AS (
SELECT level as node#
  FROM ( SELECT MAX(COALESCE(LENGTH(hierpath)
                   -LENGTH(REPLACE(hierpath,',',''))-1
                   ,0)) max_element_cnt
           FROM cteTree )
 CONNECT BY LEVEL <= max_element_cnt
),--------------------------------------------------------
-- Sum subordinate orders. PK = ReportsTo
----------------------------------------------------------
whats_up_sub AS (
SELECT SUM(cteTree.BY_SELF) AS BY_SUB
     , SUBSTR(hierpath,INSTR(hierpath,',',1,node#)+1
             ,INSTR(hierpath,',',1,node#+1)-INSTR(hierpath,',',1,node#)-1
             ) as ReportsTo
  FROM tally
  JOIN cteTree
    ON tally.node# <= COALESCE(LENGTH(cteTree.hierpath)
                              -LENGTH(REPLACE(cteTree.hierpath,',',''))-1,0)
 GROUP
    BY SUBSTR(hierpath,INSTR(hierpath,',',1,node#)+1
             ,INSTR(hierpath,',',1,node#+1)-INSTR(hierpath,',',1,node#)-1
             )
)---------------------------------------------------------
-- Results. PK = employeeID
----------------------------------------------------------
SELECT cteTree.employeeID
     , LPAD(cteTree.firstName||'.'||cteTree.lastName
           ,((cteTree."Level"-1)*2)
             +length(cteTree.firstName
              ||'.'||cteTree.lastName)
           ,' ') as ename
     , cteTree."Level"
     , cteTree.by_self
     , COALESCE(whats_up_sub.by_sub,0) AS by_sub
     , cteTree.by_self
      +COALESCE(whats_up_sub.by_sub,0) AS total
     , count(*)over() as emp_cnt
     , max("Level")over() as maxdepth# 
  FROM cteTree
  LEFT OUTER
  JOIN whats_up_sub
    ON whats_up_sub.ReportsTo = cteTree.employeeID
 ORDER BY cteTree.employeeID
FETCH FIRST 15 ROWS ONLY
/
Elapsed: 00:00:02.28

Statistics
----------------------------------------------------
          7  recursive calls
       8434  db block gets
       6778  consistent gets
        199  physical reads
       1208  redo size
       1261  bytes sent via SQL*Net to client
        372  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
         57  sorts (memory)
          0  sorts (disk)
         15  rows processed


The "bushy recursion" (start everywhere simultaneously) seems to behave more like a cross join:

WITH --************** Bush *******************************
-- BushOut. PK = employeeID+InitialID
----------------------------------------------------------
BushOut ( InitialID, employeeID, firstName, lastName, ReportsTo, by_self
        , HierPath, "Level", depth#
        ) AS (
SELECT emp2.EmployeeID AS InitialID
     , emp2.EmployeeID
     ,         SUBSTR(emp2.employeeName,1,1) AS FirstName
     , InitCap(SUBSTR(emp2.employeeName,3,99)) AS LastName
     , emp2.ReportsTo
     , emp2.by_self
     , emp2.ReportsTo||',' as HierPath
     , 1 as "Level" -- w.r.t. InitialID
     , CASE WHEN emp2.ReportsTo IS NULL
            THEN 1
        END AS Depth# -- w.r.t. Top Boss
  FROM emp2
 UNION ALL
SELECT BushOut.InitialID
     , emp2.EmployeeID
     ,         SUBSTR(emp2.employeeName,1,1) AS FirstName
     , InitCap(SUBSTR(emp2.employeeName,3,99)) AS LastName
     , emp2.ReportsTo
     , emp2.by_self
     , BushOut.HierPath||emp2.ReportsTo||',' AS HierPath
     , "Level" +1 as "Level" -- w.r.t. InitialID
     , BushOut.Depth# + 1 AS Depth# -- w.r.t. Top Boss
  FROM emp2
  JOIN BushOut
    ON BushOut.EmployeeID = emp2.ReportsTo
), -------------------------------------------------------
-- Smoothing (depth) across nodes
----------------------------------------------------------
Smoothing AS (
SELECT INITIALID, EMPLOYEEID, FIRSTNAME, LASTNAME
     , REPORTSTO, BY_SELF
     , SUM(CASE WHEN InitialID <> EmployeeID
                THEN by_self ELSE 0 END)OVER
        (PARTITION BY InitialID) as BY_SUB
     , SUM(by_self)OVER
        (PARTITION BY InitialID) as TOTAL
     , MAX(depth#)OVER(PARTITION BY EmployeeID) AS depth#
  FROM BushOut
) -------------------------------------------------------
-- Indent names for final results
----------------------------------------------------------
SELECT EMPLOYEEID
     , LPAD(firstName||'.'||lastName
           ,((depth#-1)*2)
             +length(firstName||' '||lastName)
           ,' ') as ename
     , DEPTH# AS "Level", BY_SELF
     , BY_SUB, TOTAL
     , count(*)over() as emp_cnt, max(Depth#)OVER() Depth#
  FROM Smoothing
 WHERE initialID = employeeID
 ORDER BY employeeID
FETCH FIRST 15 ROWS ONLY
/
Elapsed: 00:00:10.50

Statistics
----------------------------------------------------
          6  recursive calls
    3321690  db block gets
       1120  consistent gets
          0  physical reads
          0  redo size
       1258  bytes sent via SQL*Net to client
        372  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
         31  sorts (memory)
          0  sorts (disk)
         15  rows processed


The 'start everywhere' approach resulted in 244K records, a lot to 'minnow down'.

Ballantyne's approach created 10,000 CSV, which needed to be shredded and summed, but it's a much smaller data set.

Did I miss anything?
Chris Saxon
October 24, 2017 - 7:56 am UTC

What's emp2.by_self? This seems to be missing from the create table.

Sorry: OrderCnt = By_self

Duke Ganote, October 24, 2017 - 12:31 pm UTC

Sorry, I must've grabbed a slightly earlier version of the emp2 CTAS. BY_SELF is just a rename of OrderCnt.
Chris Saxon
October 24, 2017 - 1:56 pm UTC

So yes, when you've scaled the data up, the bushy method is processing 200k+ rows in most steps of the plan. This leads to it using a fair bit of temp space for me:

-- run query, then:

select * from table(dbms_xplan.display_cursor(null, null, 'ALLSTATS LAST'));

------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                      | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem | Used-Tmp|
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                               |      |      1 |        |     15 |00:00:13.70 |    2479K|  13353 |  13352 |       |       |          |         |
|*  1 |  VIEW                                          |      |      1 |  28853 |     15 |00:00:13.70 |    2479K|  13353 |  13352 |       |       |          |         |
|   2 |   WINDOW SORT                                  |      |      1 |  28853 |  10001 |00:00:13.69 |    2479K|  13353 |  13352 |   974K|   535K|  865K (0)|         |
|*  3 |    VIEW                                        |      |      1 |  28853 |  10001 |00:00:13.67 |    2479K|  13353 |  13352 |       |       |          |         |
|   4 |     WINDOW SORT                                |      |      1 |  28853 |    209K|00:00:13.54 |    2479K|  13353 |  13352 |    14M|  1445K| 1538K (1)|   15360 |
|   5 |      WINDOW SORT                               |      |      1 |  28853 |    209K|00:00:12.36 |    2479K|  10028 |  10027 |    13M|  1413K|5716K (28)|   14336 |
|   6 |       VIEW                                     |      |      1 |  28853 |    209K|00:00:10.99 |    2479K|   2476 |   2475 |       |       |          |         |
|   7 |        UNION ALL (RECURSIVE WITH) BREADTH FIRST|      |      1 |        |    209K|00:00:10.72 |    2479K|   2476 |   2475 |  1116K|   557K|  991K (0)|         |
|   8 |         TABLE ACCESS FULL                      | EMP2 |      1 |   9472 |  10001 |00:00:00.01 |      76 |      0 |      0 |       |       |          |         |
|*  9 |         HASH JOIN                              |      |     24 |  19381 |    199K|00:00:01.11 |    1833 |   2475 |   1561 |  1671K|  1337K| 1834K (1)|    1024 |
|  10 |          TABLE ACCESS FULL                     | EMP2 |     24 |   9472 |    240K|00:00:00.02 |    1824 |      0 |      0 |       |       |          |         |
|  11 |          RECURSIVE WITH PUMP                   |      |     24 |        |    209K|00:00:00.17 |       9 |    914 |      0 |       |       |          |         |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------


Whereas the Ballantyne method deals with a smaller data set most of the time:

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name                       | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |                            |      1 |        |     15 |00:00:01.46 |   18553 |    356 |    355 |       |       |          |
|   1 |  TEMP TABLE TRANSFORMATION                 |                            |      1 |        |     15 |00:00:01.46 |   18553 |    356 |    355 |       |       |          |
|   2 |   LOAD AS SELECT                           |                            |      1 |        |      0 |00:00:00.23 |    9595 |      0 |    355 |  2068K|  2068K|          |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST|                            |      1 |        |  10001 |00:00:00.19 |    9234 |      0 |      0 |   761K|   499K|  676K (0)|
|*  4 |     TABLE ACCESS FULL                      | EMP2                       |      1 |      1 |      1 |00:00:00.01 |      76 |      0 |      0 |       |       |          |
|   5 |     WINDOW SORT                            |                            |     24 |      2 |  10000 |00:00:00.11 |    1824 |      0 |      0 |   974K|   535K|  865K (0)|
|*  6 |      HASH JOIN                             |                            |     24 |      2 |  10000 |00:00:00.08 |    1824 |      0 |      0 |  1255K|  1004K| 1233K (0)|
|   7 |       RECURSIVE WITH PUMP                  |                            |     24 |        |  10001 |00:00:00.01 |       0 |      0 |      0 |       |       |          |
|   8 |       TABLE ACCESS FULL                    | EMP2                       |     24 |   9472 |    240K|00:00:00.02 |    1824 |      0 |      0 |       |       |          |
|*  9 |   VIEW                                     |                            |      1 |      3 |     15 |00:00:01.23 |    8953 |    355 |      0 |       |       |          |
|  10 |    WINDOW SORT                             |                            |      1 |      3 |  10001 |00:00:01.22 |    8953 |    355 |      0 |  1045K|   546K|  928K (0)|
|* 11 |     HASH JOIN OUTER                        |                            |      1 |      3 |  10001 |00:00:01.20 |    8953 |    355 |      0 |  1671K|  1337K| 1577K (0)|
|  12 |      VIEW                                  |                            |      1 |      3 |  10001 |00:00:00.02 |     361 |    355 |      0 |       |       |          |
|  13 |       TABLE ACCESS FULL                    | SYS_TEMP_0FD9D6EBC_3B7555B |      1 |      3 |  10001 |00:00:00.01 |     361 |    355 |      0 |       |       |          |
|  14 |      VIEW                                  |                            |      1 |      1 |   5948 |00:00:01.14 |    8592 |      0 |      0 |       |       |          |
|  15 |       HASH GROUP BY                        |                            |      1 |      1 |   5948 |00:00:01.13 |    8592 |      0 |      0 |  1354K|  1354K| 1365K (0)|
|  16 |        NESTED LOOPS                        |                            |      1 |      1 |    199K|00:00:00.84 |    8592 |      0 |      0 |       |       |          |
|  17 |         VIEW                               |                            |      1 |      1 |     23 |00:00:00.03 |     358 |      0 |      0 |       |       |          |
|  18 |          CONNECT BY WITHOUT FILTERING      |                            |      1 |        |     23 |00:00:00.03 |     358 |      0 |      0 |  2048 |  2048 | 2048  (0)|
|  19 |           VIEW                             |                            |      1 |      1 |      1 |00:00:00.03 |     358 |      0 |      0 |       |       |          |
|  20 |            SORT AGGREGATE                  |                            |      1 |      1 |      1 |00:00:00.03 |     358 |      0 |      0 |       |       |          |
|  21 |             VIEW                           |                            |      1 |      3 |  10001 |00:00:00.02 |     358 |      0 |      0 |       |       |          |
|  22 |              TABLE ACCESS FULL             | SYS_TEMP_0FD9D6EBC_3B7555B |      1 |      3 |  10001 |00:00:00.01 |     358 |      0 |      0 |       |       |          |
|* 23 |         VIEW                               |                            |     23 |      1 |    199K|00:00:00.58 |    8234 |      0 |      0 |       |       |          |
|  24 |          TABLE ACCESS FULL                 | SYS_TEMP_0FD9D6EBC_3B7555B |     23 |      3 |    230K|00:00:00.17 |    8234 |      0 |      0 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


Match_recognize still seems to be coming out best though ;)

with tree (reportsto, employeeid, c, tot, pth, lev, FirstName, LastName) as (
  select reportsto, employeeid, by_self, by_self,
         employeeid || '',
         1 lev,
         SUBSTR(c.employeeName,1,1) AS FirstName
        , InitCap(SUBSTR(c.employeeName,3,99)) AS LastName
  from   emp2 c
  where  reportsto is null
  union all
  select c.reportsto, c.employeeid, c.by_self, 0 ,
         pth || '|' || c.employeeid,
         lev + 1 lev
         ,SUBSTR(c.employeeName,1,1) AS FirstName
        , InitCap(SUBSTR(c.employeeName,3,99)) AS LastName
  from   tree t
  join   emp2 c
  on     t.employeeid = c.reportsto
) search depth first by employeeid set seq
  select * from tree
  match_recognize (
    order by seq
    measures
      first(employeeid) as employeeid,
      first(reportsto) as reportsto,
      first( 
        lpad ( 
          (firstname || lastname),
          (lev - 1) * 2 +length(firstName||' '||lastName),
          ' ' 
        )
      ) as nm,
      first (lev) as lev,
      first(c) as attrs,
      sum(c) as subtot,
      first(seq) as seq
    after match skip to next row
    pattern ( strt subtree* )
    define 
      subtree as lev > first (lev)
  )
  order  by employeeid
  fetch first 15 rows only;
  
select * from table(dbms_xplan.display_cursor(null, null, 'ALLSTATS LAST'));

------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                                   |      |      1 |        |     15 |00:00:00.38 |    9234 |       |       |          |
|*  1 |  VIEW                                              |      |      1 |     15 |     15 |00:00:00.38 |    9234 |       |       |          |
|*  2 |   WINDOW SORT PUSHED RANK                          |      |      1 |      3 |     15 |00:00:00.38 |    9234 |  2048 |  2048 | 2048  (0)|
|   3 |    VIEW                                            |      |      1 |      3 |  10001 |00:00:00.37 |    9234 |       |       |          |
|   4 |     BUFFER SORT                                    |      |      1 |      3 |  10001 |00:00:00.36 |    9234 |  1470K|   606K| 1306K (0)|
|   5 |      MATCH RECOGNIZE BUFFER DETERMINISTIC FINITE AU|      |      1 |      3 |  10001 |00:00:00.34 |    9234 |  1045K|   546K|  928K (0)|
|   6 |       VIEW                                         |      |      1 |      3 |  10001 |00:00:00.17 |    9234 |       |       |          |
|   7 |        UNION ALL (RECURSIVE WITH) DEPTH FIRST      |      |      1 |        |  10001 |00:00:00.15 |    9234 |  2250K|   697K| 1999K (0)|
|*  8 |         TABLE ACCESS FULL                          | EMP2 |      1 |      1 |      1 |00:00:00.01 |      76 |       |       |          |
|*  9 |         HASH JOIN                                  |      |     24 |      2 |  10000 |00:00:00.07 |    1824 |   996K|   996K| 1273K (0)|
|  10 |          RECURSIVE WITH PUMP                       |      |     24 |        |  10001 |00:00:00.01 |       0 |       |       |          |
|  11 |          TABLE ACCESS FULL                         | EMP2 |     24 |   9472 |    240K|00:00:00.02 |    1824 |       |       |          |
------------------------------------------------------------------------------------------------------------------------------------------------

Game, set, and match_recognize

Duke Ganote, October 24, 2017 - 3:06 pm UTC

The MATCH_RECOGNIZE query starts with the same top-down recursion as Dave Ballantyne, so both are dealing with the smaller initial data set (albeit MR data is normalized, rather than CSV'd). I'll have to poke some at MR... thank you!
Chris Saxon
October 24, 2017 - 3:45 pm UTC

:) Yep, the recursion is the same, but there's less processing after that step.

PL/SQL-Version

Racer I., October 26, 2017 - 12:02 pm UTC

Hi,

I'm wondering if it is possible to make an SQL-Statement that visits each node only once. Also if match_recognize does. A pl/sql-version to benchmark (by timing, probably not possible by explain plan and autotrace costs) against :

CREATE OR REPLACE PACKAGE CountLeaves AS
  CURSOR OutputCursor IS
    SELECT
      TO_NUMBER(Dummy) Pid,
      TO_NUMBER(Dummy) AttCnt,
      TO_NUMBER(Dummy) TotalCnt
    FROM DUAL; 
  TYPE OutputType IS TABLE OF OutputCursor%ROWTYPE;   
  FUNCTION DoCounts RETURN  OutputType PIPELINED;
END CountLeaves;
/

SHOW ERRORS;

CREATE OR REPLACE PACKAGE BODY CountLeaves AS
  BULK_FETCH_LIMIT CONSTANT NUMBER := 1000;
  CURSOR InputCursor IS
    WITH
    AttCnt AS (
      SELECT h.parent_id, h.child_id, COUNT(a.id) cnt
      FROM   hq_test h
        LEFT JOIN hq_test_attr a ON (h.child_id = a.id)
      GROUP BY h.parent_id, h.child_id),
    RecursiveWith (parent_id, child_id, cnt, lvl) AS (    
      SELECT parent_id, child_id, cnt, 1 lvl
      FROM   AttCnt ac
      WHERE  parent_id IS NULL
      UNION ALL
      SELECT ac.parent_id, ac.child_id, ac.cnt, r.lvl + 1
      FROM   RecursiveWith r
        JOIN AttCnt       ac ON (r.child_id = ac.parent_id))
      SEARCH DEPTH FIRST BY child_id SET pos  
    SELECT child_id pid, cnt, lvl
    FROM   RecursiveWith
    ORDER BY pos;
  TYPE InputType   IS TABLE OF InputCursor%ROWTYPE;   
  TYPE CounterType IS TABLE OF OutputCursor%ROWTYPE INDEX BY BINARY_INTEGER;
  FUNCTION DoCounts
    RETURN  OutputType PIPELINED
  IS
    vCounters CounterType;
    vRows     InputType;
    vResult   OutputCursor%ROWTYPE;
    vLevel    NUMBER := 0;
    vLvl      NUMBER;
    vCnt      NUMBER;
    vPos      NUMBER;
    vFinish   BOOLEAN := FALSE;    
  BEGIN
    DBMS_OUTPUT.ENABLE(NULL);
    vCounters(0).Pid := -1;
    OPEN InputCursor;
    LOOP
      EXIT WHEN vFinish;
      FETCH InputCursor BULK COLLECT INTO vRows LIMIT BULK_FETCH_LIMIT;
      IF (vRows.COUNT < BULK_FETCH_LIMIT) THEN
        vRows.EXTEND;
        vRows(vRows.COUNT).Lvl := 1;
        vFinish := TRUE;        
      END IF;
      vPos := 1;
      LOOP
        vLvl := vRows(vPos).Lvl;
        vCnt := vRows(vPos).Cnt;
        LOOP
          EXIT WHEN (vLevel < vLvl);
          PIPE ROW(vCounters(vLevel));
          vLevel := vLevel - 1;
          vCounters(vLevel).TotalCnt := vCounters(vLevel).TotalCnt + vCounters(vLevel + 1).TotalCnt;
        END LOOP;        
        vLevel                      := vLvl;
        vCounters(vLevel).Pid       := vRows(vPos).Pid;
        vCounters(vLevel).AttCnt    := vCnt;
        vCounters(vLevel).TotalCnt  := vCnt;
        vPos := vPos + 1;
        EXIT WHEN (vPos > vRows.COUNT);
      END LOOP; 
    END LOOP;
    CLOSE InputCursor;
    RETURN;
  END DoCounts;
END CountLeaves;
/

SELECT *
FROM   TABLE(CountLeaves.DoCounts())
ORDER BY PID


Couldn't get it to accept the select as an input cursor (with the CURSOR(SELECT...-Syntax). Probably a limitation of RW and PL/SQL.
Chris Saxon
October 30, 2017 - 5:49 pm UTC

Nice work. There's a lot of nested loops there though, so I'd be interested to see how this scales.

The plan for MR seems to be showing that it only accesses each node once. Though it depends what exactly happens in the "MATCH RECOGNIZE BUFFER DETERMINISTIC FINITE AU" step.


More to Explore

Analytics

Analytic SQL got you confused? Check out Connor McDonald's complete video course.