I'm trying to use march_recognize() to find purchases made for each customer for 10+ consecutive days.
A day being the next calendar date. For example, if customer 1 made 2 purchases on 10-MAY-2024 at 1300 hours and 1400 hours this would not be 2 consecutive days it would be considered 1 day.Whereas if customer 1 made a purchase on 10-MAY-2024 at 23:59:59 and on 11-MAY-2024 at 00:00:00 this would be considered 2 consecutive days since the calendar date has changed although it's not 24 hours after the first purchase on 10-MAY-2024 at 23:59:59.
Based on my test CASE below and sample data I appear to be finding the following streak of days
CUSTOMER_ID FIRST_NAME LAST_NAME START_DATE END_DATE CONSECUTIVE_DAYS and I am unsure why?
2 Jane Smith 15-JAN-2023 20-JAN-2023 6
As you can see this is only 6 consecutive days not 10 or more therefore I thought the match_recognize() would have filtered this out. Is this something match_recognize can detect? If so, how? If not, can you suggest a workaround?
ALTER SESSION SET NLS_TIMESTAMP_FORMAT = 'DD-MON-YYYY';
CREATE TABLE customers
(CUSTOMER_ID, FIRST_NAME, LAST_NAME) AS
SELECT 1, 'Ann', 'Aaron' FROM DUAL UNION ALL
SELECT 2, 'Jane', 'Smith' FROM DUAL UNION ALL
SELECT 3, 'Bonnie', 'Winterbottom' FROM DUAL UNION ALL
SELECT 4, 'Sandy', 'Herring' FROM DUAL UNION ALL
SELECT 5, 'Roz', 'Doyle' FROM DUAL;
create table purchases(
ORDER_ID NUMBER GENERATED BY DEFAULT AS IDENTITY (START WITH 1) NOT NULL,
customer_id number,
PRODUCT_ID NUMBER,
QUANTITY NUMBER,
purchase_date timestamp
);
insert into purchases (customer_id, product_id, quantity, purchase_date)
select 2 customer_id, 102 product_id, 2 quantity,
TIMESTAMP '2024-04-03 00:00:00' + INTERVAL '18' HOUR + ((LEVEL-1) * INTERVAL '1 00:00:01' DAY TO SECOND) * -1 + ((LEVEL-1) * interval '0.007125' second)
as purchase_date
from dual
connect by level <= 15 UNION all
select 1, 101, 1,
DATE '2024-03-08' + INTERVAL '14' HOUR + ((LEVEL-1) * INTERVAL '1 00:00:00' DAY TO SECOND) * -1
from dual
connect by level <= 5 UNION ALL
select 3, 103, 3,
DATE '2024-02-08' + INTERVAL '15' HOUR + ((LEVEL-1) * INTERVAL '0 23:59:59' DAY TO SECOND) * -1
from dual
connect by level <= 5
UNION all
select 2, 102,1, date '2023-07-29' + level * interval '1' day from dual
connect by level <= 12
union all
select 2, 103,1, date '2023-08-29' + level * interval '1' day from dual
connect by level <= 15
union all
select 2, 104,1, date '2023-11-11' + level * interval '1' day from dual
connect by level <= 9
union all
select 4, 103,(3*LEVEL), TIMESTAMP '2023-06-01 05:18:03' + numtodsinterval ( (LEVEL -1) * 1, 'day' ) + numtodsinterval ( LEVEL * 37, 'minute' ) + numtodsinterval ( LEVEL * 3, 'second' ) FROM dual
CONNECT BY LEVEL <= 4 UNION ALL
SELECT 3, 102, 4,TIMESTAMP '2022-12-22 21:44:35' + NUMTODSINTERVAL ( LEVEL * 3, 'DAY') FROM dual
CONNECT BY LEVEL <= 13 UNION ALL
select 1, 104, (2 * LEVEL), date '2023-07-02' + level * interval '1 15:13' day to minute from dual
connect by level <= 7
union all
select 1, 101,3, date '2023-03-29' + level * interval '2' day from dual
connect by level <= 12
union all
select 2, 101,2, date '2023-01-15' + level * interval '8' hour from dual
connect by level <= 15
union all
select 2, 102,2,date '2023-04-13' + level * interval '1 1' day to hour from dual
connect by level <= 11
union all
select 3, 101,2, date '2023-02-01' + level * interval '1 05:03' day to minute from dual
connect by level <= 10
union all
select 3, 101,1, date '2023-04-22' + level * interval '23' hour from dual
connect by level <= 23
union all
select 3, 100,1, date '2022-03-01' + level * interval '1 00:23:05' day to second from dual
connect by level <= 15
union all
select 3, 101,1, date '2022-03-01' + level * interval '1 00:23:05' day to second from dual
connect by level <= 15
union all
select 4, 102,1, date '2023-01-01' + level * interval '5' hour from dual
connect by level <= 60;
select m.customer_id,
c.first_name,
c.last_name,
m.first_date,
m.last_date,
trunc(m.last_date) - trunc(m.first_date) + 1 as consecutive_days
FROM purchases pur match_recognize
(
partition by customer_id
order by purchase_date
measures
first(purchase_date) as first_date,
last(purchase_date) as last_date
one row per match
pattern(start_date p{9,})
define p as trunc(purchase_date) <= prev(trunc(purchase_date)) + interval '1' day
) m
LEFT OUTER JOIN customers c ON c.customer_id = m.customer_id;
Pattern matching counts rows - not days. Add a count of the rows matched and we can see that there are 15 rows in the first match:
select m.customer_id,
c.first_name,
c.last_name,
trunc(m.last_date) - trunc(m.first_date) + 1 as consecutive_days,
rws
from purchases pur match_recognize (
partition by customer_id
order by purchase_date
measures
first(purchase_date) as first_date,
last(purchase_date) as last_date,
count(*) as rws
one row per match
pattern(start_date p{9,})
define p as trunc(purchase_date) <= prev(trunc(purchase_date)) + interval '1' day
) m
left outer join customers c on c.customer_id = m.customer_id;
CUSTOMER_ID FIRST_ LAST_NAME CONSECUTIVE_DAYS RWS
----------- ------ ------------ ---------------- ----------
2 Jane Smith 6 15
2 Jane Smith 11 11
2 Jane Smith 12 12
2 Jane Smith 15 15
2 Jane Smith 15 15
3 Bonnie Winterbottom 15 30
3 Bonnie Winterbottom 23 23
4 Sandy Herring 13 60
This satisfies the pattern - one row followed by nine or more rows within 24 hours of the previous.
There are various ways you can overcome this. One is to add a where clause to the bottom of the query checking at least ten days have elapsed.
Another is to group the rows by day first. This means you have a 1:1 match between rows and days. So you can check there are at least 10 consecutive rows:
alter session set nls_date_format = 'DD-MON-YYYY';
with dys as (
select customer_id, trunc ( purchase_date ) dy from purchases
group by customer_id, trunc ( purchase_date )
)
select * from dys match_recognize (
partition by customer_id
order by dy
measures
first(dy) as first_date,
last(dy) as last_date,
count(*) as consecutive_days
pattern ( init consecutive{9,} )
define
consecutive as dy = prev ( dy ) + 1
) join customers
using ( customer_id );
CUSTOMER_ID FIRST_DATE LAST_DATE CONSECUTIVE_DAYS FIRST_ LAST_NAME
----------- ----------- ----------- ---------------- ------ ------------
2 14-APR-2023 24-APR-2023 11 Jane Smith
2 30-JUL-2023 10-AUG-2023 12 Jane Smith
2 30-AUG-2023 13-SEP-2023 15 Jane Smith
2 20-MAR-2024 03-APR-2024 15 Jane Smith
3 02-MAR-2022 16-MAR-2022 15 Bonnie Winterbottom
3 22-APR-2023 14-MAY-2023 23 Bonnie Winterbottom
4 01-JAN-2023 13-JAN-2023 13 Sandy Herring
The upside of grouping first is pattern matching has a smaller data set, particularly if customers make many purchases per day.
The downside is it no longer accounts for time. So if the first purchase is at 10-MAY-2024 at 00:00:00 and the second 11-MAY-2024 at 23:59:59, these are considered consecutive days. Even though there are more than 24 hours between them.