Thanks for the question, Paul .

Asked: May 09, 2024 - 9:46 am UTC

Last updated: May 10, 2024 - 10:32 am UTC

Version: 19c

Viewed 1000+ times

You Asked

I'm trying to use march_recognize() to find purchases made for each customer for 10+ consecutive days.

A day being the next calendar date. For example, if customer 1 made 2 purchases on 10-MAY-2024 at 1300 hours and 1400 hours this would not be 2 consecutive days it would be considered 1 day.Whereas if customer 1 made a purchase on 10-MAY-2024 at 23:59:59 and on 11-MAY-2024 at 00:00:00 this would be considered 2 consecutive days since the calendar date has changed although it's not 24 hours after the first purchase on 10-MAY-2024 at 23:59:59.

Based on my test CASE below and sample data I appear to be finding the following streak of days

CUSTOMER_ID FIRST_NAME LAST_NAME START_DATE END_DATE CONSECUTIVE_DAYS and I am unsure why?

2 Jane Smith 15-JAN-2023 20-JAN-2023 6

As you can see this is only 6 consecutive days not 10 or more therefore I thought the match_recognize() would have filtered this out. Is this something match_recognize can detect? If so, how? If not, can you suggest a workaround?

ALTER SESSION SET NLS_TIMESTAMP_FORMAT = 'DD-MON-YYYY';

CREATE TABLE customers 
(CUSTOMER_ID, FIRST_NAME, LAST_NAME) AS
SELECT 1, 'Ann', 'Aaron' FROM DUAL UNION ALL
SELECT 2, 'Jane', 'Smith' FROM DUAL UNION ALL
SELECT 3, 'Bonnie', 'Winterbottom' FROM DUAL UNION ALL
SELECT 4, 'Sandy', 'Herring' FROM DUAL UNION ALL
SELECT 5, 'Roz', 'Doyle' FROM DUAL;

create table purchases(
  ORDER_ID NUMBER GENERATED BY DEFAULT AS IDENTITY (START WITH 1) NOT NULL,
  customer_id   number, 
  PRODUCT_ID NUMBER, 
  QUANTITY NUMBER, 
  purchase_date timestamp
);

insert into purchases (customer_id, product_id, quantity, purchase_date)
    select  2 customer_id, 102 product_id, 2 quantity,
   TIMESTAMP '2024-04-03 00:00:00' + INTERVAL '18' HOUR + ((LEVEL-1) * INTERVAL '1 00:00:01' DAY TO SECOND)  * -1 +   ((LEVEL-1) * interval '0.007125' second) 
           as purchase_date
    from    dual
    connect by level <= 15 UNION all
    select  1, 101, 1,
          DATE '2024-03-08' + INTERVAL '14' HOUR + ((LEVEL-1) * INTERVAL '1 00:00:00' DAY TO SECOND) * -1
   from    dual
   connect by level <= 5 UNION ALL 
select  3, 103, 3,
          DATE '2024-02-08' + INTERVAL '15' HOUR + ((LEVEL-1) * INTERVAL '0 23:59:59' DAY TO SECOND) * -1
   from    dual
   connect by level <= 5
 UNION all
select 2, 102,1, date '2023-07-29' + level * interval '1' day from dual
          connect by level <= 12
union all
select 2, 103,1, date '2023-08-29' + level * interval '1' day from dual
          connect by level <= 15
union all
select 2, 104,1, date '2023-11-11' + level * interval '1' day from dual
          connect by level <= 9
union all
select 4, 103,(3*LEVEL), TIMESTAMP '2023-06-01 05:18:03' + numtodsinterval ( (LEVEL -1) * 1, 'day' ) + numtodsinterval ( LEVEL * 37, 'minute' ) +  numtodsinterval ( LEVEL * 3, 'second' ) FROM    dual
CONNECT BY  LEVEL <= 4 UNION ALL 
SELECT 3, 102, 4,TIMESTAMP '2022-12-22 21:44:35' + NUMTODSINTERVAL ( LEVEL * 3, 'DAY') FROM    dual
CONNECT BY  LEVEL <= 13 UNION ALL 
select 1, 104, (2 * LEVEL), date '2023-07-02' + level * interval '1 15:13' day to minute from dual
          connect by level <= 7
union all
select 1, 101,3, date '2023-03-29' + level * interval '2' day from dual
          connect by level <= 12
union all
select 2, 101,2, date '2023-01-15' + level * interval '8' hour from dual
          connect by level <= 15
union all
select 2, 102,2,date '2023-04-13' + level * interval '1 1' day to hour from dual
          connect by level <= 11
union all
select 3, 101,2, date '2023-02-01' + level * interval '1 05:03' day to minute from dual
          connect by level <= 10
union all
select 3, 101,1, date '2023-04-22' + level * interval '23' hour from dual
          connect by level <= 23
union all
select 3, 100,1,  date '2022-03-01' + level * interval '1 00:23:05' day to second from dual
          connect by level <= 15
union all
select 3, 101,1,  date '2022-03-01' + level * interval '1 00:23:05' day to second from dual
          connect by level <= 15
union all
select 4, 102,1, date '2023-01-01' + level * interval '5' hour from dual
          connect by level <= 60;

select m.customer_id,  
    c.first_name,
    c.last_name,
    m.first_date, 
    m.last_date, 
   trunc(m.last_date) - trunc(m.first_date) + 1 as consecutive_days
 FROM purchases pur match_recognize
(
    partition by customer_id
    order by purchase_date
    measures 
      first(purchase_date) as first_date, 
       last(purchase_date) as last_date 
     one row per match                
   pattern(start_date p{9,}) 
   define p as trunc(purchase_date) <= prev(trunc(purchase_date)) + interval '1' day 
) m 
LEFT OUTER JOIN customers c ON c.customer_id = m.customer_id;

and Chris said...

Pattern matching counts rows - not days. Add a count of the rows matched and we can see that there are 15 rows in the first match:

select m.customer_id,  
    c.first_name,
    c.last_name,
    trunc(m.last_date) - trunc(m.first_date) + 1 as consecutive_days, 
    rws
from purchases pur match_recognize (
    partition by customer_id
    order by purchase_date
    measures 
      first(purchase_date) as first_date, 
      last(purchase_date) as last_date,
      count(*) as rws
     one row per match                
   pattern(start_date p{9,}) 
   define p as trunc(purchase_date) <= prev(trunc(purchase_date)) + interval '1' day 
) m 
left outer join customers c on c.customer_id = m.customer_id;

CUSTOMER_ID FIRST_ LAST_NAME    CONSECUTIVE_DAYS        RWS
----------- ------ ------------ ---------------- ----------
          2 Jane   Smith                       6         15
          2 Jane   Smith                      11         11
          2 Jane   Smith                      12         12
          2 Jane   Smith                      15         15
          2 Jane   Smith                      15         15
          3 Bonnie Winterbottom               15         30
          3 Bonnie Winterbottom               23         23
          4 Sandy  Herring                    13         60

This satisfies the pattern - one row followed by nine or more rows within 24 hours of the previous.

There are various ways you can overcome this. One is to add a where clause to the bottom of the query checking at least ten days have elapsed.

Another is to group the rows by day first. This means you have a 1:1 match between rows and days. So you can check there are at least 10 consecutive rows:

alter session set nls_date_format = 'DD-MON-YYYY';

with dys as ( 
  select customer_id, trunc ( purchase_date ) dy from purchases
  group  by customer_id, trunc ( purchase_date ) 
)
select * from dys match_recognize (
  partition by customer_id
  order by dy
  measures 
    first(dy) as first_date, 
    last(dy) as last_date,
    count(*) as consecutive_days
  pattern ( init consecutive{9,} )
  define 
    consecutive as dy = prev ( dy ) + 1
) join customers 
using  ( customer_id );

CUSTOMER_ID FIRST_DATE  LAST_DATE   CONSECUTIVE_DAYS FIRST_ LAST_NAME   
----------- ----------- ----------- ---------------- ------ ------------
          2 14-APR-2023 24-APR-2023               11 Jane   Smith       
          2 30-JUL-2023 10-AUG-2023               12 Jane   Smith       
          2 30-AUG-2023 13-SEP-2023               15 Jane   Smith       
          2 20-MAR-2024 03-APR-2024               15 Jane   Smith       
          3 02-MAR-2022 16-MAR-2022               15 Bonnie Winterbottom
          3 22-APR-2023 14-MAY-2023               23 Bonnie Winterbottom
          4 01-JAN-2023 13-JAN-2023               13 Sandy  Herring

The upside of grouping first is pattern matching has a smaller data set, particularly if customers make many purchases per day.

The downside is it no longer accounts for time. So if the first purchase is at 10-MAY-2024 at 00:00:00 and the second 11-MAY-2024 at 23:59:59, these are considered consecutive days. Even though there are more than 24 hours between them.

Rating

(3 ratings)

Comments

It is possible to do everyting in one MATCH_RECOGNIZE pass

mathguy, May 09, 2024 - 7:19 pm UTC

If MATCH_RECOGNIZE itself is written more carefully, there will be no need for pre-processing (using an aggregation before MATCH_RECOGNIZE) or for post-processing (WHERE clause in the query).

This was pointed out to the OP on a different site (Oracle Forums) two days ago. He posted the same question there; the MATCH_RECOGNIZE solution he posted here on AskTom is not his solution, it's the answer he accepted on the Oracle Forum. The author of that solution showed the correction (using a WHERE clause) two days ago. I posted the solution below, also two days ago. Instead of continuing the discussion there, he chose to come here and not even mention the work done on the Oracle Forum. Totally uncool.

As I mentioned there, I left out the final join (to pick up customer details) since that is not really related to the question, it's the same in any solution, and it's easy.

select customer_id, first_date, last_date, last_date - first_date + 1 as consecutive_days
from   purchases
       match_recognize(
         partition by customer_id
         order     by purchase_date
         measures  trunc(first(purchase_date)) as first_date,
                   trunc(last(purchase_date))  as last_date
         pattern   (f s* (n s*){9,})
         subset    x = (f, n)
         define    s as trunc(purchase_date) = prev(trunc(purchase_date)),
                   n as trunc(purchase_date) = prev(trunc(x.purchase_date)) + 1
       )
;

The classifiers are F for First (row in a match), N for first row on the Next date, and S for additional rows on the Same date. The pattern looks for one or more rows with a First date (for a match), followed by at least nine more subsets of rows on Next dates, as required. All Same dates are included, in case the OP needs the count, or needs ALL ROWS PER MATCH, or needs to aggregate other things (total amount purchased, etc.)

May 10, 2024 - 10:32 am UTC

Thanks for sharing and giving the background.

Note that the pattern (f s* (n s*){9,}) could lead to a lot of backtracking in the regular expression. Depending on the data this could be incredibly slow.

Catastrophic backtracking?

mathguy, May 10, 2024 - 4:08 pm UTC

Note that the pattern (f s* (n s*){9,}) could lead to a lot of backtracking in the regular expression. Depending on the data this could be incredibly slow.

Are you thinking about "catastrophic backtracking"? If so, then no - that pattern does not lead to catastrophic backtracking, even though there are nested quantifiers.

Catastrophic backtracking can happen when the same sequence of characters (or of rows, for MATCH_RECOGNIZE) can be a "partial match" to the regexp in more than one way - then the number of potential matches to be followed grows exponentially. This is not the case in my solution; a row can be either S or N, but it can't be both, and there are no alternations in the pattern. A sequence of rows can be a partial match in at most one way, no more. There will be a bit of backtracking as there is with these kinds of patterns, but not the catastrophic kind that may lead to "incredibly slow".

There is one improvement that I can make (and applies also to "the OP's" solution - the classifier F can also be defined specifically to be a non-N, non-S (that is: F is either the first row in a partition, or a row two or more days later than the previous row, using the OP's definition based on calendar date only, ignoring time-of-day). The way I wrote the solution, after a failed match, the next attempt starts at the immediately following row, even if that is not "two or more days later." The point being, if a row was classified as N or S in an earlier attempt and the earlier attempt failed, that row can't be an F in a successful match starting at that row. That can be added easily and it will improve performance, but not drastically.

Speed testing - aggregating by days first is fastest

mathguy, May 13, 2024 - 3:29 pm UTC

I did some testing on two main cases. I tested "my" solution, "the OP's" solution with the WHERE clause added, and Chris's solution where purchases are aggregated by days first, and then MATCH_RECOGNIZE only needs to chase after dates (rather than through the individual rows of each partition). In all cases, I added a definition in the DEFINE clause of MATCH_RECOGNIZE so that after a failed matching attempt at one row, the search doesn't resume from the immediately following row (I explained this business in my previous reply).

In all cases: my solution is slightly slower than the OP's solution with WHERE clause added. This seems to be because for my solution, the Oracle implementation of MATCH_RECOGNIZE chooses a non-deterministic finite automaton to search for matches. This is sub-optimal, but perhaps I ask too much of MATCH_RECOGNIZE. In any case, even with this deadweight, "my" solution takes only about 20% longer to complete in the "easy" cases, and about 80% longer to complete in the hardest case, which I will describe below. Two conclusions here: First, as I suspected, even in the hardest cases, "my" solution doesn't take forever - it does not cause catastrophic backtracking. Second, even though in theory "my" solution should be more efficient (based just on math reasoning), in practice we must also consider how the solution will be implemented. If for whatever reason Oracle won't use a DFA for "my" solution, then it won't be fast in practice, even though it should be in theory. (I wonder if there is a hint or some other method to tell the optimizer to use a DFA when it won't choose one on its own.)

IN ANY CASE: the "aggregate first, then MATCH_RECOGNIZE the intermediate output" is faster in all tests; about twice as fast as the other solutions (which don't aggregate by date first) in the easy cases, and about three times as fast at the OP's solution and five times as fast as "my" solution in the hard case. That's the solution the OP should use, if he has large amounts of data.

Here is the "hard" test I used. Regular expressions have the most difficulty with inputs where the pattern "almost" matches but in the end it fails. So I set up a case where each month, for 30 different months, a customer makes 20 purchases per day on the first nine days of the month, and then nothing for the rest of the month. As before - I ignored customer first and last name (using just the PURCHASES table). I deleted everything from the table and inserted as shown below. The input table has 270,000 rows; "my" solution, which is the slowest, completes in 0.65 seconds on my machine, so for this amount of data speed may not even matter. (In my comments I didn't include the actual times I got, only the relative performance, because the actual times will vary a lot depending on hardware, etc.)

insert into purchases (customer_id, product_id, quantity, purchase_date)
with
  c (cid) as (select level from dual connect by level <= 50),
  h (hr)  as (select level from dual connect by level <= 20),
  d (dy)  as (select level from dual connect by level <=  9),
  m (mth) as (select level from dual connect by level <= 30)
select cid as customer_id,
       cid + hr + dy + mth as product_id,
       hr * dy as quantity,
       add_months(date '2020-01-01', mth) + dy + hr/24 + cid/1440 as purchase_date
from   c, h, d, m
;

More to Explore

Analytics

Analytic SQL got you confused? Check out Connor McDonald's complete video course.

Breadcrumb

Question and Answer

You Asked

and Chris said...

Rating

Comments

It is possible to do everyting in one MATCH_RECOGNIZE pass

Catastrophic backtracking?

Speed testing - aggregating by days first is fastest

More to Explore

Analytics

Finding purchases on 10+ consecutive days

Breadcrumb

Question and Answer

You Asked

and Chris said...

Rating

Comments

It is possible to do everyting in one MATCH_RECOGNIZE pass

Catastrophic backtracking?

Speed testing - aggregating by days first is fastest

More to Explore

Analytics