Alternate Solution Using Analytic Functions
Art Metzer, February 13, 2004 - 4:29 pm UTC
Tom,
I'm the one who originally posted this question way back when.
Over the past couple of years, I've always thought there might be a better way out there to solve this problem, perhaps one that uses analytic functions.
As I've gotten more comfortable with them, I've been tinkering off and on with analytic functions in the context of this problem. I think I've finally found an analytic function solution for it, and I wanted to report my findings back here.
Here's some test data that, like my original example, has complete, non-overlapping date coverage between an ID's earliest and latest dates:
CREATE TABLE range_test (id NUMBER, plan VARCHAR2(1), d1 DATE, d2 DATE);
INSERT INTO range_test VALUES (101,'B',TO_DATE('20000501','YYYYMMDD'),TO_DATE('20000715','YYYYMMDD'));
INSERT INTO range_test VALUES (101,'B',TO_DATE('20000716','YYYYMMDD'),TO_DATE('20001219','YYYYMMDD'));
INSERT INTO range_test VALUES (101,'B',TO_DATE('20001220','YYYYMMDD'),TO_DATE('20010303','YYYYMMDD'));
INSERT INTO range_test VALUES (101,'C',TO_DATE('20010304','YYYYMMDD'),TO_DATE('20010630','YYYYMMDD'));
INSERT INTO range_test VALUES (101,'B',TO_DATE('20010701','YYYYMMDD'),TO_DATE('20021114','YYYYMMDD'));
INSERT INTO range_test VALUES (101,'B',TO_DATE('20021115','YYYYMMDD'),TO_DATE('20021116','YYYYMMDD'));
INSERT INTO range_test VALUES (101,'C',TO_DATE('20021117','YYYYMMDD'),TO_DATE('20021231','YYYYMMDD'));
INSERT INTO range_test VALUES (101,'C',TO_DATE('20030101','YYYYMMDD'),TO_DATE('20030412','YYYYMMDD'));
INSERT INTO range_test VALUES (101,'C',TO_DATE('20030413','YYYYMMDD'),TO_DATE('20031001','YYYYMMDD'));
INSERT INTO range_test VALUES (101,'B',TO_DATE('20031002','YYYYMMDD'),TO_DATE('47121231','YYYYMMDD'));
INSERT INTO range_test VALUES (102,'A',TO_DATE('19990719','YYYYMMDD'),TO_DATE('47121231','YYYYMMDD'));
INSERT INTO range_test VALUES (103,'C',TO_DATE('20020228','YYYYMMDD'),TO_DATE('20020922','YYYYMMDD'));
INSERT INTO range_test VALUES (103,'C',TO_DATE('20020923','YYYYMMDD'),TO_DATE('20021028','YYYYMMDD'));
INSERT INTO range_test VALUES (103,'D',TO_DATE('20021029','YYYYMMDD'),TO_DATE('20030506','YYYYMMDD'));
INSERT INTO range_test VALUES (103,'D',TO_DATE('20030507','YYYYMMDD'),TO_DATE('47121231','YYYYMMDD'));
INSERT INTO range_test VALUES (104,'B',TO_DATE('20000819','YYYYMMDD'),TO_DATE('20001003','YYYYMMDD'));
INSERT INTO range_test VALUES (104,'B',TO_DATE('20001004','YYYYMMDD'),TO_DATE('20010623','YYYYMMDD'));
INSERT INTO range_test VALUES (104,'B',TO_DATE('20010624','YYYYMMDD'),TO_DATE('20010917','YYYYMMDD'));
INSERT INTO range_test VALUES (104,'B',TO_DATE('20010918','YYYYMMDD'),TO_DATE('20031116','YYYYMMDD'));
INSERT INTO range_test VALUES (104,'B',TO_DATE('20031117','YYYYMMDD'),TO_DATE('47121231','YYYYMMDD'));
INSERT INTO range_test VALUES (105,'A',TO_DATE('19990416','YYYYMMDD'),TO_DATE('20010404','YYYYMMDD'));
INSERT INTO range_test VALUES (105,'A',TO_DATE('20010405','YYYYMMDD'),TO_DATE('20020331','YYYYMMDD'));
INSERT INTO range_test VALUES (105,'A',TO_DATE('20020401','YYYYMMDD'),TO_DATE('20020402','YYYYMMDD'));
INSERT INTO range_test VALUES (105,'A',TO_DATE('20020403','YYYYMMDD'),TO_DATE('20021230','YYYYMMDD'));
INSERT INTO range_test VALUES (105,'C',TO_DATE('20021231','YYYYMMDD'),TO_DATE('20030710','YYYYMMDD'));
INSERT INTO range_test VALUES (105,'C',TO_DATE('20030711','YYYYMMDD'),TO_DATE('20031008','YYYYMMDD'));
INSERT INTO range_test VALUES (105,'C',TO_DATE('20031009','YYYYMMDD'),TO_DATE('47121231','YYYYMMDD'));
INSERT INTO range_test VALUES (106,'C',TO_DATE('20000624','YYYYMMDD'),TO_DATE('20000707','YYYYMMDD'));
INSERT INTO range_test VALUES (106,'C',TO_DATE('20000708','YYYYMMDD'),TO_DATE('20000928','YYYYMMDD'));
INSERT INTO range_test VALUES (106,'C',TO_DATE('20000929','YYYYMMDD'),TO_DATE('20010111','YYYYMMDD'));
INSERT INTO range_test VALUES (106,'C',TO_DATE('20010112','YYYYMMDD'),TO_DATE('20011011','YYYYMMDD'));
INSERT INTO range_test VALUES (106,'C',TO_DATE('20011012','YYYYMMDD'),TO_DATE('20030504','YYYYMMDD'));
INSERT INTO range_test VALUES (106,'B',TO_DATE('20030505','YYYYMMDD'),TO_DATE('47121231','YYYYMMDD'));
INSERT INTO range_test VALUES (107,'A',TO_DATE('20021102','YYYYMMDD'),TO_DATE('20021130','YYYYMMDD'));
INSERT INTO range_test VALUES (107,'B',TO_DATE('20021201','YYYYMMDD'),TO_DATE('20021231','YYYYMMDD'));
INSERT INTO range_test VALUES (107,'C',TO_DATE('20030101','YYYYMMDD'),TO_DATE('20030131','YYYYMMDD'));
INSERT INTO range_test VALUES (107,'D',TO_DATE('20030201','YYYYMMDD'),TO_DATE('20030617','YYYYMMDD'));
INSERT INTO range_test VALUES (107,'A',TO_DATE('20030618','YYYYMMDD'),TO_DATE('47121231','YYYYMMDD'));
INSERT INTO range_test VALUES (108,'A',TO_DATE('20020904','YYYYMMDD'),TO_DATE('20020920','YYYYMMDD'));
INSERT INTO range_test VALUES (108,'A',TO_DATE('20020921','YYYYMMDD'),TO_DATE('20021001','YYYYMMDD'));
INSERT INTO range_test VALUES (108,'A',TO_DATE('20021002','YYYYMMDD'),TO_DATE('20021003','YYYYMMDD'));
INSERT INTO range_test VALUES (108,'B',TO_DATE('20021004','YYYYMMDD'),TO_DATE('20021015','YYYYMMDD'));
INSERT INTO range_test VALUES (108,'B',TO_DATE('20021016','YYYYMMDD'),TO_DATE('20021120','YYYYMMDD'));
INSERT INTO range_test VALUES (108,'B',TO_DATE('20021121','YYYYMMDD'),TO_DATE('20030213','YYYYMMDD'));
INSERT INTO range_test VALUES (108,'C',TO_DATE('20030214','YYYYMMDD'),TO_DATE('20030329','YYYYMMDD'));
INSERT INTO range_test VALUES (108,'C',TO_DATE('20030330','YYYYMMDD'),TO_DATE('20030331','YYYYMMDD'));
INSERT INTO range_test VALUES (108,'C',TO_DATE('20030401','YYYYMMDD'),TO_DATE('20030825','YYYYMMDD'));
INSERT INTO range_test VALUES (108,'A',TO_DATE('20030826','YYYYMMDD'),TO_DATE('20030828','YYYYMMDD'));
INSERT INTO range_test VALUES (108,'A',TO_DATE('20030829','YYYYMMDD'),TO_DATE('20040201','YYYYMMDD'));
INSERT INTO range_test VALUES (108,'A',TO_DATE('20040202','YYYYMMDD'),TO_DATE('47121231','YYYYMMDD'));
Here is the solution you had provided:
SELECT
DISTINCT id
, plan
, (SELECT MIN(rt2.d1)
FROM range_test rt2
START WITH rt2.id = rt.id
AND rt2.plan = rt.plan
AND rt2.d1 = rt.d1
CONNECT BY PRIOR rt2.d1-1 = rt2.d2
AND rt2.id = rt.id
AND rt2.plan = rt.plan) st
, (SELECT MAX(rt2.d2)
FROM range_test rt2
START WITH rt2.id = rt.id
AND rt2.plan = rt.plan
AND rt2.d1 = rt.d1
CONNECT BY PRIOR rt2.d2+1 = rt2.d1
AND rt2.id = rt.id
AND rt2.plan = rt.plan) en
FROM range_test rt
ORDER BY 1,3
/
Here is the analytic function solution I came up with:
SELECT chgs.id
, chgs.plan
, MIN(chgs.d1) st
, MAX(chgs.d2) en
FROM (SELECT flips.id
, flips.plan
, flips.d1
, flips.d2
-- Within each id, this sum nth_chg:
-- starts at 1...
-- ...stays the same for contiguous date ranges having the same plan value...
-- ...and increments when a new plan value is encountered.
, SUM(flips.ctr)
OVER (PARTITION BY flips.id
ORDER BY flips.d1) + 1 nth_chg
FROM (SELECT rt.id
, rt.plan
, rt.d1
, rt.d2
, LEAD (rt.plan)
OVER (PARTITION BY rt.id
ORDER BY rt.d1) next_plan
-- ctr is 1 if current row represents a change in plan value for an id
, DECODE(LAG(rt.plan) OVER (PARTITION BY rt.id ORDER BY rt.d1)
, NULL, 0
, rt.plan, 0
, 1) ctr
FROM range_test rt
ORDER BY rt.id
, rt.d1) flips
) chgs
-- This GROUP BY collapses on nth_chg to get start and end
-- dates for each plan-consistent range of time for an id.
GROUP BY chgs.id
, chgs.plan
, chgs.nth_chg
ORDER BY chgs.id
, MIN(chgs.d1)
/
Given the above test data, both queries produce the following result set:
ID P ST EN
---------- - ---------- ----------
101 B 5/1/2000 3/3/2001
101 C 3/4/2001 6/30/2001
101 B 7/1/2001 11/16/2002
101 C 11/17/2002 10/1/2003
101 B 10/2/2003 12/31/4712
102 A 7/19/1999 12/31/4712
103 C 2/28/2002 10/28/2002
103 D 10/29/2002 12/31/4712
104 B 8/19/2000 12/31/4712
105 A 4/16/1999 12/30/2002
105 C 12/31/2002 12/31/4712
106 C 6/24/2000 5/4/2003
106 B 5/5/2003 12/31/4712
107 A 11/2/2002 11/30/2002
107 B 12/1/2002 12/31/2002
107 C 1/1/2003 1/31/2003
107 D 2/1/2003 6/17/2003
107 A 6/18/2003 12/31/4712
108 A 9/4/2002 10/3/2002
108 B 10/4/2002 2/13/2003
108 C 2/14/2003 8/25/2003
108 A 8/26/2003 12/31/4712
22 rows selected.
The tkprof output from the CONNECT BY solution:
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.01 0.01 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 2 0.05 0.04 0 499 1204 22
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 4 0.06 0.05 0 499 1204 22
Misses in library cache during parse: 1
Optimizer goal: CHOOSE
Parsing user id: 217
Rows Row Source Operation
------- ---------------------------------------------------
22 SORT UNIQUE
50 TABLE ACCESS FULL RANGE_TEST
The tkprof output from the analytic function solution:
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.04 0 0 0 0
Execute 1 0.00 0.01 0 0 0 0
Fetch 2 0.01 0.01 0 1 4 22
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 4 0.01 0.06 0 1 4 22
Misses in library cache during parse: 1
Optimizer goal: CHOOSE
Parsing user id: 217
Rows Row Source Operation
------- ---------------------------------------------------
22 SORT ORDER BY
22 SORT GROUP BY
50 VIEW
50 WINDOW BUFFER
50 VIEW
50 WINDOW SORT
50 TABLE ACCESS FULL RANGE_TEST
Analytic functions rock and roll!
Creating a pivot table from date effective data
Maria, April 26, 2004 - 12:06 pm UTC
My requirement is to produce a PivotTable that will display the number of people in a given plan, in a given month.
As a preliminary step, I have created the following view which "fakes" a join to a dummy select which I use to generate 24 months worth of data....
CREATE OR REPLACE VIEW RANGE_TEST_V ( ID
, PLAN
, ST
, EN
, SEQUENCE
, PERIOD_START_DATE
, PERIOD_END_DATE
, EFFECTIVE_DATE )
AS SELECT ranges.id
, ranges.plan
, ranges.st
, ranges.en
, sequences.sequence
--
-- if the start date of the row is before the start of
-- the reporting period, decode to start date of reporting period
, DECODE(SIGN((to_date('01-'||to_char(add_months(trunc(sysdate),-24),'MON-RRRR'),'DD-MON-RRRR')-1)-ranges.st)
,1, ranges.st
,(to_date('01-'||to_char(add_months(trunc(sysdate),-24),'MON-RRRR'),'DD-MON-RRRR')-1)) period_start_date
--
-- if the end date of the row is after the end of the reporting period,
-- decode to end of reporting period
--
, DECODE(SIGN(ranges.en-(to_date('01-'||to_char(trunc(sysdate),'MON-RRRR'),'DD-MON-RRRR')))
,-1, ranges.en
,(to_date('01-'||to_char(trunc(sysdate),'MON-RRRR'),'DD-MON-RRRR'))) period_end_date
--
-- determine the last date of the month for this row, based on the sequence number
-- this allows us to pick rows where the effective date is between start (st) and end (en) dates
--
, add_months(to_date('01-'||TO_CHAR(
DECODE(SIGN((to_date('01-'||to_char(add_months(trunc(sysdate),-24),'MON-RRRR'),'DD-MON-RRRR')-1)-ranges.st)
,1, ranges.st
,(to_date(
'01-'||to_char(add_months(trunc(sysdate),-24),'MON-RRRR'),
'DD-MON-RRRR')-1)),'MON-RRRR'),'DD-MON-RRRR')-1,sequence) effective_date
FROM
(SELECT chgs.id
, chgs.plan
, MIN(chgs.d1) st
, MAX(chgs.d2) en
, 'x' dummy
FROM (SELECT flips.id
, flips.plan
, flips.d1
, flips.d2
-- Within each id, this sum nth_chg:
-- starts at 1...
-- ...stays the same for contiguous date ranges having the same plan value...
-- ...and increments when a new plan value is encountered.
, SUM(flips.ctr)
OVER (PARTITION BY flips.id
ORDER BY flips.d1) + 1 nth_chg
FROM (SELECT rt.id
, rt.plan
, rt.d1
, rt.d2
, LEAD (rt.plan)
OVER (PARTITION BY rt.id
ORDER BY rt.d1) next_plan
-- ctr is 1 if current row represents a change in plan value for an id
, DECODE(LAG(rt.plan) OVER (PARTITION BY rt.id ORDER BY rt.d1)
, NULL, 0
, rt.plan, 0
, 1) ctr
FROM range_test rt
WHERE d2 >= (to_date('01-'||to_char(add_months(trunc(sysdate),-24),'MON-RRRR'),'DD-MON-RRRR')-1)
AND d1 <= (to_date('01-'||to_char(trunc(sysdate),'MON-RRRR'),'DD-MON-RRRR'))
ORDER BY rt.id, rt.d1) flips
) chgs
GROUP BY chgs.id
, chgs.plan
, chgs.nth_chg
ORDER BY chgs.id, MIN(chgs.d1)) ranges
,
(SELECT ROWNUM sequence, 'x' dummy
FROM
(SELECT NULL FROM DUAL
GROUP BY CUBE(1,1,1,1,1))
) sequences
WHERE ranges.dummy = sequences.dummy
AND sequences.sequence <= 24 -- for 24 months of data
AND add_months(to_date('01-'||TO_CHAR(
DECODE(SIGN((to_date('01-'||to_char(add_months(trunc(sysdate),-24),'MON-RRRR'),'DD-MON-RRRR')-1)-ranges.st)
,1, ranges.st
,(to_date('01-'||to_char(add_months(trunc(sysdate),-24),'MON-RRRR'),'DD-MON-RRRR')-1)),'MON-RRRR'),'DD-MON-RRRR')-1,sequence)
between ranges.st and ranges.en
/
View created.
Then do a select...
select count(id), plan, effective_date
from range_test_v
where effective_date between to_date('01-JAN-2001','DD-MON-RRRR') and to_date('31-DEC-2001','DD-MON-RRRR')
group by plan, effective_date
/
COUNT(ID) P EFFECTIVE
---------- - ---------
1 A 31-JAN-01
1 A 28-FEB-01
1 A 31-MAR-01
2 A 30-APR-01
2 A 31-MAY-01
2 A 30-JUN-01
1 A 31-JUL-01
1 A 31-AUG-01
1 A 30-SEP-01
1 A 31-OCT-01
1 A 30-NOV-01
1 A 31-DEC-01
1 B 31-JUL-01
1 B 31-AUG-01
2 B 30-SEP-01
2 B 31-OCT-01
2 B 30-NOV-01
2 B 31-DEC-01
1 C 31-OCT-01
1 C 30-NOV-01
1 C 31-DEC-01
= data I can use to create my PivotTable - not a problem.
I use the plan_id at the end of each month to determine which rows to include - so if a person moves between plans in a month, I only use the plan_id as at the last day of the month in the view.
If a person leaves a plan during the month, and doesn't enter a new plan, I do not count them towards that month.
I have a feeling that I am missing something here and that there is a far more elegant solution to be found to this problem.
One issue that I have spotted is that if a person entered a plan more than 24 months ago, I will miss them off the count because I am running out of months - my add_months on the start date of the plan will not give me enough months to work with up to the present date.
I have not found a solution to this yet (except to use more than 24 months in the query, or to work backwards from the present date - which I am working on ....) :(
I am also thinking about discarding the analytics and using the raw data instead as I do not necessarily need to know the earliest start and latest end date for the plan/person.
Any assistance or hints would be greatly appreciated.
Regards,
Maria.
p.s. The word wrap has killed the formatting :(