Skip to Main Content
  • Questions
  • Collapsing one column on an effective-dated table (using CONNECT BY?)

Breadcrumb

Question and Answer

Tom Kyte

Thanks for the question, Art.

Asked: May 24, 2002 - 12:13 pm UTC

Last updated: February 01, 2005 - 1:04 pm UTC

Version: 8.1.7

Viewed 1000+ times

You Asked

Hi, Tom. I have the following columns on a person table:

PERSON_ID PLAN_ID PER_EFF_START_DT PER_EFF_END_DT
--------- ------- ---------------- --------------
54 29 07-AUG-1998 09-FEB-1999
54 29 10-FEB-1999 16-AUG-1999
54 29 17-AUG-1999 30-DEC-1999
54 48 31-DEC-1999 06-SEP-2000
54 48 07-SEP-2000 02-JAN-2001
54 48 03-JAN-2001 18-NOV-2001
54 29 19-NOV-2001 31-DEC-2010

Each person ID can belong to a plan for a range of time. Other changes to person can also introduce new effective-dated records.
I can guarantee that the date ranges for a single PERSON_ID will
be contiguous and will never overlap, as shown. (I.e., every
day between a person's earliest start date and latest end date
will always be accounted for exactly once.)

I would like to "collapse" the above rows into what you might call
a plan-centric view of the data, so that I only see the date limits
of when the plan changes (I don't care about the other changes that
prompted the addition of effective-dated rows). So that as a
result of this SQL query, I'd have:

PERSON_ID PLAN_ID PLAN_EFF_START_DT PLAN_EFF_END_DT
--------- ------- ----------------- ---------------
54 29 07-AUG-1998 30-DEC-1999
54 48 31-DEC-1999 18-NOV-2001
54 29 19-NOV-2001 31-DEC-2010

Doing a simple MAX/GROUP BY causes me to lose the fact that
29 existed as a part of two distinct intervals, a fact which
I need to retain. I've tried some CONNECT BY logic, but I'm
not sure what, if anything, to use as a START WITH condition.

Can you help?

Thanks, Tom.

Art

and Tom said...

We'll need to run a connect by twice for each row -- once to look "backwards" for the start and one to look "forwards" for the end -- and then distinct it

ops$tkyte@ORA817DEV.US.ORACLE.COM> select distinct person_id, plan_id,
2 (select min(per_eff_start_dt)
3 from t t2
4 start with t2.person_id = t.person_id
5 and t2.plan_id = t.plan_id
6 and t2.per_eff_start_dt = t.per_eff_start_dt
7 connect by prior per_eff_start_dt-1 = per_eff_end_dt
8 and person_id = t.person_id
9 and plan_id = t.plan_id ) st,
10 (select max(per_eff_end_dt)
11 from t t2
12 start with t2.person_id = t.person_id
13 and t2.plan_id = t.plan_id
14 and t2.per_eff_start_dt = t.per_eff_start_dt
15 connect by prior per_eff_end_dt+1 = per_eff_start_dt
16 and person_id = t.person_id
17 and plan_id = t.plan_id ) en
18 from t
19 order by 3 ;

PERSON_ID PLAN_ID ST EN
---------- ---------- --------- ---------
54 29 07-AUG-98 30-DEC-99
54 48 31-DEC-99 18-NOV-01
54 29 19-NOV-01 31-DEC-10


Won't be horribly efficient on really really large sets of data!



Rating

  (4 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Thank you

Art Metzer, May 26, 2002 - 9:40 am UTC


Alternate Solution Using Analytic Functions

Art Metzer, February 13, 2004 - 4:29 pm UTC

Tom,

I'm the one who originally posted this question way back when.

Over the past couple of years, I've always thought there might be a better way out there to solve this problem, perhaps one that uses analytic functions.

As I've gotten more comfortable with them, I've been tinkering off and on with analytic functions in the context of this problem. I think I've finally found an analytic function solution for it, and I wanted to report my findings back here.

Here's some test data that, like my original example, has complete, non-overlapping date coverage between an ID's earliest and latest dates:

CREATE TABLE range_test (id NUMBER, plan VARCHAR2(1), d1 DATE, d2 DATE);
INSERT INTO range_test VALUES (101,'B',TO_DATE('20000501','YYYYMMDD'),TO_DATE('20000715','YYYYMMDD'));
INSERT INTO range_test VALUES (101,'B',TO_DATE('20000716','YYYYMMDD'),TO_DATE('20001219','YYYYMMDD'));
INSERT INTO range_test VALUES (101,'B',TO_DATE('20001220','YYYYMMDD'),TO_DATE('20010303','YYYYMMDD'));
INSERT INTO range_test VALUES (101,'C',TO_DATE('20010304','YYYYMMDD'),TO_DATE('20010630','YYYYMMDD'));
INSERT INTO range_test VALUES (101,'B',TO_DATE('20010701','YYYYMMDD'),TO_DATE('20021114','YYYYMMDD'));
INSERT INTO range_test VALUES (101,'B',TO_DATE('20021115','YYYYMMDD'),TO_DATE('20021116','YYYYMMDD'));
INSERT INTO range_test VALUES (101,'C',TO_DATE('20021117','YYYYMMDD'),TO_DATE('20021231','YYYYMMDD'));
INSERT INTO range_test VALUES (101,'C',TO_DATE('20030101','YYYYMMDD'),TO_DATE('20030412','YYYYMMDD'));
INSERT INTO range_test VALUES (101,'C',TO_DATE('20030413','YYYYMMDD'),TO_DATE('20031001','YYYYMMDD'));
INSERT INTO range_test VALUES (101,'B',TO_DATE('20031002','YYYYMMDD'),TO_DATE('47121231','YYYYMMDD'));
INSERT INTO range_test VALUES (102,'A',TO_DATE('19990719','YYYYMMDD'),TO_DATE('47121231','YYYYMMDD'));
INSERT INTO range_test VALUES (103,'C',TO_DATE('20020228','YYYYMMDD'),TO_DATE('20020922','YYYYMMDD'));
INSERT INTO range_test VALUES (103,'C',TO_DATE('20020923','YYYYMMDD'),TO_DATE('20021028','YYYYMMDD'));
INSERT INTO range_test VALUES (103,'D',TO_DATE('20021029','YYYYMMDD'),TO_DATE('20030506','YYYYMMDD'));
INSERT INTO range_test VALUES (103,'D',TO_DATE('20030507','YYYYMMDD'),TO_DATE('47121231','YYYYMMDD'));
INSERT INTO range_test VALUES (104,'B',TO_DATE('20000819','YYYYMMDD'),TO_DATE('20001003','YYYYMMDD'));
INSERT INTO range_test VALUES (104,'B',TO_DATE('20001004','YYYYMMDD'),TO_DATE('20010623','YYYYMMDD'));
INSERT INTO range_test VALUES (104,'B',TO_DATE('20010624','YYYYMMDD'),TO_DATE('20010917','YYYYMMDD'));
INSERT INTO range_test VALUES (104,'B',TO_DATE('20010918','YYYYMMDD'),TO_DATE('20031116','YYYYMMDD'));
INSERT INTO range_test VALUES (104,'B',TO_DATE('20031117','YYYYMMDD'),TO_DATE('47121231','YYYYMMDD'));
INSERT INTO range_test VALUES (105,'A',TO_DATE('19990416','YYYYMMDD'),TO_DATE('20010404','YYYYMMDD'));
INSERT INTO range_test VALUES (105,'A',TO_DATE('20010405','YYYYMMDD'),TO_DATE('20020331','YYYYMMDD'));
INSERT INTO range_test VALUES (105,'A',TO_DATE('20020401','YYYYMMDD'),TO_DATE('20020402','YYYYMMDD'));
INSERT INTO range_test VALUES (105,'A',TO_DATE('20020403','YYYYMMDD'),TO_DATE('20021230','YYYYMMDD'));
INSERT INTO range_test VALUES (105,'C',TO_DATE('20021231','YYYYMMDD'),TO_DATE('20030710','YYYYMMDD'));
INSERT INTO range_test VALUES (105,'C',TO_DATE('20030711','YYYYMMDD'),TO_DATE('20031008','YYYYMMDD'));
INSERT INTO range_test VALUES (105,'C',TO_DATE('20031009','YYYYMMDD'),TO_DATE('47121231','YYYYMMDD'));
INSERT INTO range_test VALUES (106,'C',TO_DATE('20000624','YYYYMMDD'),TO_DATE('20000707','YYYYMMDD'));
INSERT INTO range_test VALUES (106,'C',TO_DATE('20000708','YYYYMMDD'),TO_DATE('20000928','YYYYMMDD'));
INSERT INTO range_test VALUES (106,'C',TO_DATE('20000929','YYYYMMDD'),TO_DATE('20010111','YYYYMMDD'));
INSERT INTO range_test VALUES (106,'C',TO_DATE('20010112','YYYYMMDD'),TO_DATE('20011011','YYYYMMDD'));
INSERT INTO range_test VALUES (106,'C',TO_DATE('20011012','YYYYMMDD'),TO_DATE('20030504','YYYYMMDD'));
INSERT INTO range_test VALUES (106,'B',TO_DATE('20030505','YYYYMMDD'),TO_DATE('47121231','YYYYMMDD'));
INSERT INTO range_test VALUES (107,'A',TO_DATE('20021102','YYYYMMDD'),TO_DATE('20021130','YYYYMMDD'));
INSERT INTO range_test VALUES (107,'B',TO_DATE('20021201','YYYYMMDD'),TO_DATE('20021231','YYYYMMDD'));
INSERT INTO range_test VALUES (107,'C',TO_DATE('20030101','YYYYMMDD'),TO_DATE('20030131','YYYYMMDD'));
INSERT INTO range_test VALUES (107,'D',TO_DATE('20030201','YYYYMMDD'),TO_DATE('20030617','YYYYMMDD'));
INSERT INTO range_test VALUES (107,'A',TO_DATE('20030618','YYYYMMDD'),TO_DATE('47121231','YYYYMMDD'));
INSERT INTO range_test VALUES (108,'A',TO_DATE('20020904','YYYYMMDD'),TO_DATE('20020920','YYYYMMDD'));
INSERT INTO range_test VALUES (108,'A',TO_DATE('20020921','YYYYMMDD'),TO_DATE('20021001','YYYYMMDD'));
INSERT INTO range_test VALUES (108,'A',TO_DATE('20021002','YYYYMMDD'),TO_DATE('20021003','YYYYMMDD'));
INSERT INTO range_test VALUES (108,'B',TO_DATE('20021004','YYYYMMDD'),TO_DATE('20021015','YYYYMMDD'));
INSERT INTO range_test VALUES (108,'B',TO_DATE('20021016','YYYYMMDD'),TO_DATE('20021120','YYYYMMDD'));
INSERT INTO range_test VALUES (108,'B',TO_DATE('20021121','YYYYMMDD'),TO_DATE('20030213','YYYYMMDD'));
INSERT INTO range_test VALUES (108,'C',TO_DATE('20030214','YYYYMMDD'),TO_DATE('20030329','YYYYMMDD'));
INSERT INTO range_test VALUES (108,'C',TO_DATE('20030330','YYYYMMDD'),TO_DATE('20030331','YYYYMMDD'));
INSERT INTO range_test VALUES (108,'C',TO_DATE('20030401','YYYYMMDD'),TO_DATE('20030825','YYYYMMDD'));
INSERT INTO range_test VALUES (108,'A',TO_DATE('20030826','YYYYMMDD'),TO_DATE('20030828','YYYYMMDD'));
INSERT INTO range_test VALUES (108,'A',TO_DATE('20030829','YYYYMMDD'),TO_DATE('20040201','YYYYMMDD'));
INSERT INTO range_test VALUES (108,'A',TO_DATE('20040202','YYYYMMDD'),TO_DATE('47121231','YYYYMMDD'));

Here is the solution you had provided:
SELECT
DISTINCT id
, plan
, (SELECT MIN(rt2.d1)
FROM range_test rt2
START WITH rt2.id = rt.id
AND rt2.plan = rt.plan
AND rt2.d1 = rt.d1
CONNECT BY PRIOR rt2.d1-1 = rt2.d2
AND rt2.id = rt.id
AND rt2.plan = rt.plan) st
, (SELECT MAX(rt2.d2)
FROM range_test rt2
START WITH rt2.id = rt.id
AND rt2.plan = rt.plan
AND rt2.d1 = rt.d1
CONNECT BY PRIOR rt2.d2+1 = rt2.d1
AND rt2.id = rt.id
AND rt2.plan = rt.plan) en
FROM range_test rt
ORDER BY 1,3
/

Here is the analytic function solution I came up with:

SELECT chgs.id
, chgs.plan
, MIN(chgs.d1) st
, MAX(chgs.d2) en
FROM (SELECT flips.id
, flips.plan
, flips.d1
, flips.d2
-- Within each id, this sum nth_chg:
-- starts at 1...
-- ...stays the same for contiguous date ranges having the same plan value...
-- ...and increments when a new plan value is encountered.
, SUM(flips.ctr)
OVER (PARTITION BY flips.id
ORDER BY flips.d1) + 1 nth_chg
FROM (SELECT rt.id
, rt.plan
, rt.d1
, rt.d2
, LEAD (rt.plan)
OVER (PARTITION BY rt.id
ORDER BY rt.d1) next_plan
-- ctr is 1 if current row represents a change in plan value for an id
, DECODE(LAG(rt.plan) OVER (PARTITION BY rt.id ORDER BY rt.d1)
, NULL, 0
, rt.plan, 0
, 1) ctr
FROM range_test rt
ORDER BY rt.id
, rt.d1) flips
) chgs
-- This GROUP BY collapses on nth_chg to get start and end
-- dates for each plan-consistent range of time for an id.
GROUP BY chgs.id
, chgs.plan
, chgs.nth_chg
ORDER BY chgs.id
, MIN(chgs.d1)
/

Given the above test data, both queries produce the following result set:

ID P ST EN
---------- - ---------- ----------
101 B 5/1/2000 3/3/2001
101 C 3/4/2001 6/30/2001
101 B 7/1/2001 11/16/2002
101 C 11/17/2002 10/1/2003
101 B 10/2/2003 12/31/4712
102 A 7/19/1999 12/31/4712
103 C 2/28/2002 10/28/2002
103 D 10/29/2002 12/31/4712
104 B 8/19/2000 12/31/4712
105 A 4/16/1999 12/30/2002
105 C 12/31/2002 12/31/4712
106 C 6/24/2000 5/4/2003
106 B 5/5/2003 12/31/4712
107 A 11/2/2002 11/30/2002
107 B 12/1/2002 12/31/2002
107 C 1/1/2003 1/31/2003
107 D 2/1/2003 6/17/2003
107 A 6/18/2003 12/31/4712
108 A 9/4/2002 10/3/2002
108 B 10/4/2002 2/13/2003
108 C 2/14/2003 8/25/2003
108 A 8/26/2003 12/31/4712

22 rows selected.

The tkprof output from the CONNECT BY solution:

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.01 0.01 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 2 0.05 0.04 0 499 1204 22
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 4 0.06 0.05 0 499 1204 22

Misses in library cache during parse: 1
Optimizer goal: CHOOSE
Parsing user id: 217

Rows Row Source Operation
------- ---------------------------------------------------
22 SORT UNIQUE
50 TABLE ACCESS FULL RANGE_TEST


The tkprof output from the analytic function solution:

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.04 0 0 0 0
Execute 1 0.00 0.01 0 0 0 0
Fetch 2 0.01 0.01 0 1 4 22
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 4 0.01 0.06 0 1 4 22

Misses in library cache during parse: 1
Optimizer goal: CHOOSE
Parsing user id: 217

Rows Row Source Operation
------- ---------------------------------------------------
22 SORT ORDER BY
22 SORT GROUP BY
50 VIEW
50 WINDOW BUFFER
50 VIEW
50 WINDOW SORT
50 TABLE ACCESS FULL RANGE_TEST

Analytic functions rock and roll!

Creating a pivot table from date effective data

Maria, April 26, 2004 - 12:06 pm UTC

My requirement is to produce a PivotTable that will display the number of people in a given plan, in a given month.

As a preliminary step, I have created the following view which "fakes" a join to a dummy select which I use to generate 24 months worth of data....

CREATE OR REPLACE VIEW RANGE_TEST_V ( ID
, PLAN
, ST
, EN
, SEQUENCE
, PERIOD_START_DATE
, PERIOD_END_DATE
, EFFECTIVE_DATE )
AS SELECT ranges.id
, ranges.plan
, ranges.st
, ranges.en
, sequences.sequence
--
-- if the start date of the row is before the start of
-- the reporting period, decode to start date of reporting period
, DECODE(SIGN((to_date('01-'||to_char(add_months(trunc(sysdate),-24),'MON-RRRR'),'DD-MON-RRRR')-1)-ranges.st)
,1, ranges.st
,(to_date('01-'||to_char(add_months(trunc(sysdate),-24),'MON-RRRR'),'DD-MON-RRRR')-1)) period_start_date
--
-- if the end date of the row is after the end of the reporting period,
-- decode to end of reporting period
--
, DECODE(SIGN(ranges.en-(to_date('01-'||to_char(trunc(sysdate),'MON-RRRR'),'DD-MON-RRRR')))
,-1, ranges.en
,(to_date('01-'||to_char(trunc(sysdate),'MON-RRRR'),'DD-MON-RRRR'))) period_end_date
--
-- determine the last date of the month for this row, based on the sequence number
-- this allows us to pick rows where the effective date is between start (st) and end (en) dates
--
, add_months(to_date('01-'||TO_CHAR(
DECODE(SIGN((to_date('01-'||to_char(add_months(trunc(sysdate),-24),'MON-RRRR'),'DD-MON-RRRR')-1)-ranges.st)
,1, ranges.st
,(to_date(
'01-'||to_char(add_months(trunc(sysdate),-24),'MON-RRRR'),
'DD-MON-RRRR')-1)),'MON-RRRR'),'DD-MON-RRRR')-1,sequence) effective_date
FROM
(SELECT chgs.id
, chgs.plan
, MIN(chgs.d1) st
, MAX(chgs.d2) en
, 'x' dummy
FROM (SELECT flips.id
, flips.plan
, flips.d1
, flips.d2
-- Within each id, this sum nth_chg:
-- starts at 1...
-- ...stays the same for contiguous date ranges having the same plan value...
-- ...and increments when a new plan value is encountered.
, SUM(flips.ctr)
OVER (PARTITION BY flips.id
ORDER BY flips.d1) + 1 nth_chg
FROM (SELECT rt.id
, rt.plan
, rt.d1
, rt.d2
, LEAD (rt.plan)
OVER (PARTITION BY rt.id
ORDER BY rt.d1) next_plan
-- ctr is 1 if current row represents a change in plan value for an id
, DECODE(LAG(rt.plan) OVER (PARTITION BY rt.id ORDER BY rt.d1)
, NULL, 0
, rt.plan, 0
, 1) ctr
FROM range_test rt
WHERE d2 >= (to_date('01-'||to_char(add_months(trunc(sysdate),-24),'MON-RRRR'),'DD-MON-RRRR')-1)
AND d1 <= (to_date('01-'||to_char(trunc(sysdate),'MON-RRRR'),'DD-MON-RRRR'))
ORDER BY rt.id, rt.d1) flips
) chgs
GROUP BY chgs.id
, chgs.plan
, chgs.nth_chg
ORDER BY chgs.id, MIN(chgs.d1)) ranges
,
(SELECT ROWNUM sequence, 'x' dummy
FROM
(SELECT NULL FROM DUAL
GROUP BY CUBE(1,1,1,1,1))
) sequences
WHERE ranges.dummy = sequences.dummy
AND sequences.sequence <= 24 -- for 24 months of data
AND add_months(to_date('01-'||TO_CHAR(
DECODE(SIGN((to_date('01-'||to_char(add_months(trunc(sysdate),-24),'MON-RRRR'),'DD-MON-RRRR')-1)-ranges.st)
,1, ranges.st
,(to_date('01-'||to_char(add_months(trunc(sysdate),-24),'MON-RRRR'),'DD-MON-RRRR')-1)),'MON-RRRR'),'DD-MON-RRRR')-1,sequence)
between ranges.st and ranges.en
/

View created.

Then do a select...

select count(id), plan, effective_date
from range_test_v
where effective_date between to_date('01-JAN-2001','DD-MON-RRRR') and to_date('31-DEC-2001','DD-MON-RRRR')
group by plan, effective_date
/


COUNT(ID) P EFFECTIVE
---------- - ---------
1 A 31-JAN-01
1 A 28-FEB-01
1 A 31-MAR-01
2 A 30-APR-01
2 A 31-MAY-01
2 A 30-JUN-01
1 A 31-JUL-01
1 A 31-AUG-01
1 A 30-SEP-01
1 A 31-OCT-01
1 A 30-NOV-01
1 A 31-DEC-01
1 B 31-JUL-01
1 B 31-AUG-01
2 B 30-SEP-01
2 B 31-OCT-01
2 B 30-NOV-01
2 B 31-DEC-01
1 C 31-OCT-01
1 C 30-NOV-01
1 C 31-DEC-01

= data I can use to create my PivotTable - not a problem.

I use the plan_id at the end of each month to determine which rows to include - so if a person moves between plans in a month, I only use the plan_id as at the last day of the month in the view.

If a person leaves a plan during the month, and doesn't enter a new plan, I do not count them towards that month.

I have a feeling that I am missing something here and that there is a far more elegant solution to be found to this problem.

One issue that I have spotted is that if a person entered a plan more than 24 months ago, I will miss them off the count because I am running out of months - my add_months on the start date of the plan will not give me enough months to work with up to the present date.

I have not found a solution to this yet (except to use more than 24 months in the query, or to work backwards from the present date - which I am working on ....) :(

I am also thinking about discarding the analytics and using the raw data instead as I do not necessarily need to know the earliest start and latest end date for the plan/person.

Any assistance or hints would be greatly appreciated.

Regards,

Maria.

p.s. The word wrap has killed the formatting :(

An other simpler Analytic function solution for the orginal question.

Frank Zhou, February 01, 2005 - 1:04 pm UTC

Here is an other solution for the original question using analytic function.

Frank

SELECT id, plan, MIN(d1) d1_min , MAX(d2)
FROM (SELECT id, plan, d1, d2, grp,
MAX(grp) OVER (PARTITION BY id ORDER BY d1) grp_max
FROM (SELECT rt.id, rt.plan, rt.d1 ,rt.d2,
CASE WHEN NVL( LAG (rt.plan) OVER (PARTITION BY rt.id ORDER BY rt.d1), plan||'unknow') <> plan
THEN ROW_NUMBER() OVER (PARTITION BY rt.id ORDER BY rt.d1) END grp
FROM range_test rt)
)
GROUP BY id, grp_max, plan
ORDER BY id, d1_min