Match_recognize in SQL is awesome. But does take a bit of explaining to understand as you say.
So here goes:
Imagine you've started running. You're keeping track of your progress in the following table:
create table t (
run_date date,
time_in_s int,
distance_in_miles int
);
insert into t values (date'2017-01-01', 420, 1);
insert into t values (date'2017-01-02', 2400, 5);
insert into t values (date'2017-01-03', 2430, 5);
insert into t values (date'2017-01-04', 2350, 5);
insert into t values (date'2017-01-05', 410, 1);
insert into t values (date'2017-01-06', 400, 1);
insert into t values (date'2017-01-08', 2300, 5);
insert into t values (date'2017-01-09', 425, 1);
insert into t values (date'2017-01-10', 422, 1);
commit;
select * from t;
RUN_DATE TIME_IN_S DISTANCE_IN_MILES
01-JAN-2017 00:00:00 420 1
02-JAN-2017 00:00:00 2,400 5
03-JAN-2017 00:00:00 2,430 5
04-JAN-2017 00:00:00 2,350 5
05-JAN-2017 00:00:00 410 1
06-JAN-2017 00:00:00 400 1
08-JAN-2017 00:00:00 2,300 5
09-JAN-2017 00:00:00 425 1
10-JAN-2017 00:00:00 422 1
You want to know how many days in a row you ran faster than the previous day. Pattern matching is ideal for this kind of problem. Let's see how.
Before we do that, let's start with the simplest thing possible: matching anything.
To do this look for a pattern of "anything". Next you need to define this variable. To always have a match, you need something that is always true. One way to do this is looking for rows where the run_date equals itself:
select * from t
match_recognize (
order by run_date
pattern ( anything )
define
anything as run_date = run_date
);
SQL Error: ORA-30732: table contains no user-visible columns
Hmmm. What's going on there?
The query has no columns to display!
To resolve this you either need to add:
- A measures clause specifying output columns
- A partition by clause. These columns are included in your output
- The "all rows per match" clause, which includes the source table's columns
Let's add measures with a couple of predefined functions:
- match_number() - Assigns a number to each row in the match. All rows in the same match will have the same value. This starts at 1 and increases for each new match.
- classifier() - Which pattern variable the current row matches
These are handy for debugging. In they go and you get:
select * from t
match_recognize (
order by run_date
measures
match_number() as mno,
classifier() as cls
pattern ( anything )
define
anything as run_date = run_date
);
MNO CLS
1 ANYTHING
2 ANYTHING
3 ANYTHING
4 ANYTHING
5 ANYTHING
6 ANYTHING
7 ANYTHING
8 ANYTHING
9 ANYTHING
So that gives 9 different matches with match numbers 1-9. All match the variable anything, so this is the classifier.
Why's that? Surely if you're matching "anything", it should all be one match?
Well the pattern is a regular expression. Anything on its own means match one row. To match zero or more rows, use the asterisk operator:
select * from t
match_recognize (
order by run_date
measures
match_number() as mno,
classifier() as cls
pattern ( anything* )
define
anything as run_date = run_date
);
MNO CLS
1 ANYTHING
Now you have just a single match. But where have all the other rows gone?!
By default, match_recognize only returns one row per match. There's only one match, so you only get one row. To see them all, specify "all rows per match":
select * from t
match_recognize (
order by run_date
measures
match_number() as mno,
classifier() as cls
all rows per match
pattern ( anything* )
define
anything as run_date = run_date
);
RUN_DATE MNO CLS TIME_IN_S DISTANCE_IN_MILES
01-JAN-2017 00:00:00 1 ANYTHING 420 1
02-JAN-2017 00:00:00 1 ANYTHING 2,400 5
03-JAN-2017 00:00:00 1 ANYTHING 2,430 5
04-JAN-2017 00:00:00 1 ANYTHING 2,350 5
05-JAN-2017 00:00:00 1 ANYTHING 410 1
06-JAN-2017 00:00:00 1 ANYTHING 400 1
08-JAN-2017 00:00:00 1 ANYTHING 2,300 5
09-JAN-2017 00:00:00 1 ANYTHING 425 1
10-JAN-2017 00:00:00 1 ANYTHING 422 1
Note doing this also shows you all the columns in the source table. So I'd recommend using "all rows per match" while you're developing your queries.
Finding Faster RunsOK, that's the basics. Let's move onto the original problem: Finding days you ran faster than the previous one.
The pattern you're looking for is one or more days where time_in_s is less than this value for the previous row. The patterns use regular expressions. The + is for one or more. So your pattern variable will be:
pattern ( faster+ )
To get the value for a column in the previous row, pass it to the prev() function. So the definition of this variable is:
define
faster as time_in_s < prev(time_in_s)
Plug this into your query and you get:
select * from t
match_recognize (
order by run_date
measures
match_number() as mno,
classifier() as cls
all rows per match
pattern ( faster+ )
define
faster as time_in_s < prev(time_in_s)
);
RUN_DATE MNO CLS TIME_IN_S DISTANCE_IN_MILES
04-JAN-2017 00:00:00 1 FASTER 2,350 5
05-JAN-2017 00:00:00 1 FASTER 410 1
06-JAN-2017 00:00:00 1 FASTER 400 1
09-JAN-2017 00:00:00 2 FASTER 425 1
10-JAN-2017 00:00:00 2 FASTER 422 1
So you've two periods where you had a series of days faster than the previous. Unsurprisingly when you ran one mile the day after a five mile run, you were quicker!
These probably aren't the results you want. It's more useful to know whether you were quicker when you ran the
same distance. To split the results up like this, add a "partition by distance_in_miles" clause:
select * from t
match_recognize (
partition by distance_in_miles
order by run_date
measures
match_number() as mno,
classifier() as cls
all rows per match
pattern ( faster+ )
define
faster as time_in_s < prev(time_in_s)
);
DISTANCE_IN_MILES RUN_DATE MNO CLS TIME_IN_S
1 05-JAN-2017 00:00:00 1 FASTER 410
1 06-JAN-2017 00:00:00 1 FASTER 400
1 10-JAN-2017 00:00:00 2 FASTER 422
5 04-JAN-2017 00:00:00 1 FASTER 2,350
5 08-JAN-2017 00:00:00 1 FASTER 2,300
So now you have three periods when you were faster over the same distance. Two for one mile and one for 5 miles. Note the match_number() resets back to 1 for the group of 5 miles.
This is good. But you probably want to include the day before you were faster in your results. i.e. your very first run or the last one you were slower. To get this, add an "anything" match again:
select * from t
match_recognize (
partition by distance_in_miles
order by run_date
measures
match_number() as mno,
classifier() as cls
all rows per match
pattern ( anything faster+ )
define
faster as time_in_s < prev(time_in_s)
);
DISTANCE_IN_MILES RUN_DATE MNO CLS TIME_IN_S
1 01-JAN-2017 00:00:00 1 ANYTHING 420
1 05-JAN-2017 00:00:00 1 FASTER 410
1 06-JAN-2017 00:00:00 1 FASTER 400
1 09-JAN-2017 00:00:00 2 ANYTHING 425
1 10-JAN-2017 00:00:00 2 FASTER 422
5 03-JAN-2017 00:00:00 1 ANYTHING 2,430
5 04-JAN-2017 00:00:00 1 FASTER 2,350
5 08-JAN-2017 00:00:00 1 FASTER 2,300
Note you don't actually need to define the "anything" variable! Also the rows now have different classifiers, depending on which variable they matched in the pattern clause.
OK, this is looking better. But in your final results you may want to see:
- The first and last days in each series of improvements
- The number of consecutive days you were faster
To get these, add them to the measures clause. The first() and last() functions will, as you might expect, return the first and last values in the match for the expression. You can get the number of rows in the match with your trusty old count(*):
alter session set nls_date_format = 'DD-MON-YYYY';
select * from t
match_recognize (
partition by distance_in_miles
order by run_date
measures
match_number() as mno,
classifier() as cls,
first(run_date) as frun_date,
last(run_date) as lrun_date,
count(*) as c
all rows per match
pattern ( anything faster+ )
define
faster as time_in_s < nvl(prev(time_in_s), time_in_s)
);
DISTANCE_IN_MILES RUN_DATE MNO CLS FRUN_DATE LRUN_DATE C TIME_IN_S
1 01-JAN-2017 1 ANYTHING 01-JAN-2017 01-JAN-2017 1 420
1 05-JAN-2017 1 FASTER 01-JAN-2017 05-JAN-2017 2 410
1 06-JAN-2017 1 FASTER 01-JAN-2017 06-JAN-2017 3 400
1 09-JAN-2017 2 ANYTHING 09-JAN-2017 09-JAN-2017 1 425
1 10-JAN-2017 2 FASTER 09-JAN-2017 10-JAN-2017 2 422
5 03-JAN-2017 1 ANYTHING 03-JAN-2017 03-JAN-2017 1 2,430
5 04-JAN-2017 1 FASTER 03-JAN-2017 04-JAN-2017 2 2,350
5 08-JAN-2017 1 FASTER 03-JAN-2017 08-JAN-2017 3 2,300
Note: count(*) and last() return the relevant values up to that point. i.e. the running count and current date. To overcome this and show the actual last values in the group, add the final keyword before them:
select * from t
match_recognize (
partition by distance_in_miles
order by run_date
measures
match_number() as mno,
classifier() as cls,
first(run_date) as frun_date,
final last(run_date) as lrun_date,
final count(*) as c
all rows per match
pattern ( anything faster+ )
define
faster as time_in_s < nvl(prev(time_in_s), time_in_s)
);
DISTANCE_IN_MILES RUN_DATE MNO CLS FRUN_DATE LRUN_DATE C TIME_IN_S
1 01-JAN-2017 1 ANYTHING 01-JAN-2017 06-JAN-2017 3 420
1 05-JAN-2017 1 FASTER 01-JAN-2017 06-JAN-2017 3 410
1 06-JAN-2017 1 FASTER 01-JAN-2017 06-JAN-2017 3 400
1 09-JAN-2017 2 ANYTHING 09-JAN-2017 10-JAN-2017 2 425
1 10-JAN-2017 2 FASTER 09-JAN-2017 10-JAN-2017 2 422
5 03-JAN-2017 1 ANYTHING 03-JAN-2017 08-JAN-2017 3 2,430
5 04-JAN-2017 1 FASTER 03-JAN-2017 08-JAN-2017 3 2,350
5 08-JAN-2017 1 FASTER 03-JAN-2017 08-JAN-2017 3 2,300
At this point you're nearly done!
If you only want one row/series of faster days, remove the "all rows per match" clause. You can also take out the "final" keywords before last and count. This is because "one row per match" gives the final value:
select * from t
match_recognize (
partition by distance_in_miles
order by run_date
measures
match_number() as mno,
classifier() as cls,
first(run_date) as frun_date,
last(run_date) as lrun_date,
count(*) as c
pattern ( anything faster+ )
define
faster as time_in_s < nvl(prev(time_in_s), time_in_s)
);
DISTANCE_IN_MILES MNO CLS FRUN_DATE LRUN_DATE C
1 1 FASTER 01-JAN-2017 06-JAN-2017 3
1 2 FASTER 09-JAN-2017 10-JAN-2017 2
5 1 FASTER 03-JAN-2017 08-JAN-2017 3
Hopefully that gives you enough of the basics to get started :)
Pattern matching is a fantastic feature that makes it easy to solve problems that were previously thorny using SQL. For example, bin packing style problems:
https://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:9533067800346375228 If you need more examples, check out these by Tim Hall:
https://oracle-base.com/articles/12c/pattern-matching-in-oracle-database-12cr1 And some from Tom Kyte at:
http://www.oracle.com/technetwork/issue-archive/2013/13-nov/o63asktom-2034271.html You've already mentioned Keith Laker's match_recognize series. Here's the link to the first in the series for those interested:
http://oracle-big-data.blogspot.co.uk/2016/03/sql-pattern-matching-deep-dive-part-1.html NOTE: Edited explanation of SQL Error: ORA-30732 to include all ways to resolve this.