This sprang out of the following post:
</code>
http://asktom.oracle.com/pls/asktom/f?p=100:11:::::P11_QUESTION_ID:474221407101 <code>
you asked for more info and it started to look like an entirely new question so I have posted it as such. Here we go...
[My first Post]
I need to load lots of data very quickly from an external table the data looks
like:
id,value,from_date,to_date
I need to load this data subject to the following requirements:
1) There can be no overlaps for a particular id (so one records to_date is equal
to the next records from_date.
2) There must be a full set of data from a given historic date date (roughly
three years in the past).
3) All the data for each id must span the period from the historic date as close
to the current date as possible (given the data available).
For example:
If historic date is 01/01/2003 (DD/MM/YYYY), this is the data that I want to
load (I need to discard data that does not marry the previous records to_date to
the next records from date:
id value from_date to_date
1 10 01/01/2003 01/02/2003
1 12 01/02/2003 21/05/2003
1 19 21/05/2003 01/06/2003
1 25 01/06/2003 01/12/2003
1 25 01/12/2003 01/02/2004
How would one generate a select to pull out data that marries like this?
Also, I need to constrain the schema so that only the data that satifies these
rules will be accepted (similar to the original post). Does the method described
in this post still apply to 9.2 or is there a better way?
Best Regards,
[END]
[Your Response]
tell me, what if the inputs do have overlaps. what logic decides what rows are
actually kept (and why)
what does "there must be a full set of data" mean actually?
[END]
[My final post, with some additional information provided to the original]
(I'll answer your Q's in a sec).
I'm working at this problem from two ends.
1) a query to pull out the valid data from the external table subject to my
requirements in order to create a table
2) a loader to go through the data that was not retreived by the query. The
loader will bulk load these values and constraints will identify what the
problem was with the data (so it can be investigated and fixed)
So, I think I need
a) A suitable query to gather the data according to the requirements (more
detail will follow).
b) A mechanism to constrain the target table so that 'bad' data cannot get in
(and a constraint or trigger will raise an error that may be caught and logged)
The logic for keeping records is this:
a) The record with the earliest from_date for a particular id must have a
from_date (GREATER THEN OR EQUAL TO) 01/01/2003 (DD/MM/YYYY) (the historic date in the example above).
b) The set of records for a particular id must completely span the period (with
no gaps - ie: it fishtails so that current from_date=previous to_date) that
starts on 01/01/2003 (the historic date) and continues up to the present.
However;
provided that the data fishtails throughout the period it is allowable for
the most recent record to finish on a date before the current date.
d) there can be no duplicate data (even if the rows are the same eg: two records
with the same from and to dates.)
e) no to_date can be strictly greater than the current date.
So, you can see (I hope) that the logic requires me to valiodate "sets" of rows
(defined by the id) such that the following are "sets" are accepted:
HistDate Now
| |
ID: 1 |-----|
ID: 1 |-----------|
ID: 1 |------|
ID: 1 |----------|
ID: 2 |---------|
ID: 2 |---|
ID: 2 |-----|
ID: 2 |---------|
ID: 3 |---------------------------------|
ID: 4 |-----------------------------------|
ID: 5 |------------------|
ID: 5 |-|
and the following sets would be rejected:
HistDate Now
| |
ID: 6 |-----| <- before HistDate
ID: 7 |-----------| <- a gap
ID: 7 |------|
ID: 7 |------------|
ID: 8 |---------|
ID: 8 |-----| <- overlap
ID: 8 |-----|
ID: 8 |---------|
ID: 9 |-------------------------------------| <- greater than current date
ID: 9 |---------|
ID: 9 |--------------------|
ID: 9 |--------------------| <- duplicate
ID: 9 |----|
A "full set of data" means a consistent set of data for the id.
When loading the data I could order the loads by id and from_date to simplify
the validation. In that the data would be loaded in a (from_date) time ordered
fashion, which would satify the logic described above.
Best Regards,
[END]
"Is there only one possible "covering set" per id?"
-->Yes, there must be only one such covering set.
"what I mean -- is for id 1 -- does that represent ALL of the rows in the input set or did we have to filter out some/many of the rows with id=1 in order to find this nice covering set?"
-->Exactly. For id 1. you see the results of filtering out rows for id 1 that do not satisfy the described constraints. THe result is the covering set.
this is pretty easy with LAG and LEAD functionality. The following demonstrates the approach:
ops$tkyte@ORA920PC> select id, d1, d2, max(what) over (partition by id)
2 from (
3 select id, d1, d2, last_d1, last_d2, next_d1, next_d2,
4 case when d1 < to_date( '01-jan-2003' )
5 then 'Before Hist Date'
6 when d2 > trunc(sysdate)
7 then 'Greater than current date'
8 when max_d2 != trunc(sysdate)
9 then 'Not upto End date'
10 when d2 < next_d1
11 then 'Gap'
12 when d2 != next_d1
13 then 'Overlap'
14 end what
15 from ( select id, d1, d2,
16 lag(d1) over ( partition by id order by d1) last_d1,
17 lag(d2) over ( partition by id order by d2) last_d2,
18 lead(d1) over ( partition by id order by d1) next_d1,
19 lead(d2) over ( partition by id order by d2) next_d2,
20 max(d2) over (partition by id ) max_d2
21 from t
22 )
23 )
24 order by 1, 2, 3
25 /
ID D1 D2 MAX(WHAT)OVER(PARTITIONBY
---------- --------- --------- -------------------------
1 01-JAN-03 15-JAN-03
15-JAN-03 29-JAN-03
29-JAN-03 29-JAN-04
29-JAN-04 09-FEB-04
6 31-DEC-02 15-JAN-03 Before Hist Date
15-JAN-03 09-FEB-04 Before Hist Date
7 01-JAN-03 14-JAN-03 Gap
15-JAN-03 29-JAN-03 Gap
29-JAN-03 29-JAN-04 Gap
29-JAN-04 09-FEB-04 Gap
8 01-JAN-03 15-JAN-03 Overlap
15-JAN-03 29-JAN-03 Overlap
20-JAN-03 29-JAN-04 Overlap
29-JAN-04 09-FEB-04 Overlap
9 01-JAN-03 15-JAN-03 Not upto End date
15-JAN-03 29-JAN-03 Not upto End date
29-JAN-03 29-JAN-04 Not upto End date
29-JAN-04 10-FEB-04 Not upto End date
10 01-JAN-03 15-JAN-03 Overlap
15-JAN-03 29-JAN-03 Overlap
29-JAN-03 29-JAN-04 Overlap
29-JAN-03 29-JAN-04 Overlap
29-JAN-04 09-FEB-04 Overlap
11 01-JAN-03 15-JAN-03 Not upto End date
15-JAN-03 29-JAN-03 Not upto End date
29-JAN-03 29-JAN-04 Not upto End date
29-JAN-04 08-FEB-04 Not upto End date
27 rows selected.
I have not 100% completed the CASE statement -- you'll need to fine tune that -- but you see the idea...