Database, SQL and PL/SQL

On SQL, SQL, and More SQL

Our technologist says yes, yes, yes to interesting SQL solutions.

By Tom Kyte

March/April 2015

I was looking at my recent Ask Tom questions, figuring out what I was going to write about in this issue of Oracle Magazine, and I noticed a theme: short questions with interesting SQL solutions. SQL solutions that go a little bit outside the box.

Also, I’ve been on a “Yes SQL” kick recently, pointing out that SQL is an extremely powerful language and that with mastery of just a few features, you can do things in a highly scalable fashion with very little code.

So this column will be all about SQL, showcasing what I think are interesting applications of SQL functionality that you may either not be aware of or never thought of using in quite the way I do. In this limited space, I cannot explain every nuance of how the SQL works; rather, I’m hoping to whet your appetite and point you to where you can get more information.

Finding MIN/MAX Values

How can I create a query in Oracle Database 11g Release 2 to return the results shown? I need to find the date values at the beginning and end of the range for any sequence of consecutive rows—ordered by date—that have the same value for N. I have a feeling it involves analytic functions, but they hadn’t been invented when I took my only SQL course, in 1987!

drop table jr_tmp;
create table jr_tmp (d date, n number);
insert into jr_tmp (d, n)
values (date '2014-12-01', 1);
insert into jr_tmp (d, n)
values (date '2014-12-02', 1);
insert into jr_tmp (d, n)
values (date '2014-12-03', 1);
insert into jr_tmp (d, n)
values (date '2014-12-04', 2);
insert into jr_tmp (d, n)
values (date '2014-12-05', 2);
insert into jr_tmp (d, n)
values (date '2014-12-06', 1);
insert into jr_tmp (d, n)
values (date '2014-12-07', 1);
insert into jr_tmp (d, n)
values (date '2014-12-08', 1);
<My query here>
D_FROM    D_TO       N
————————  —————————— —————
01-DEC-14 03-DEC-14  1
04-DEC-14 05-DEC-14  2
06-DEC-14 08-DEC-14  1

The year 1987 is when I took my first SQL course. But I keep taking them, and they help me do some pretty amazing things. (And—just a thought—it might be time for a refresher!) For those who wonder what courses I take, they are all self-taught these days, and Ask Tom is my classroom. You’ll see an example of that education in this very Q&A, originally posted at bit.ly/atminmax.

There are various approaches I could take to answer this question, and the one I’ve chosen is a technique I’ve been using since analytic functions first became available, in Oracle8i Database Release 2 (a long, long time ago!). I call it my “carry down” technique. Here is the solution:

SQL> select min(d), max(d), n, grp2
  2    from (
  3  select d, n,
     last_value(grp ignore nulls)
     over (order by d) grp2
  4    from (
  5  select d, n,
  6    case when (nvl(lag(n)
       over (order by d),n-1) <> n) then
  7         row_number()
       over (order by d)
  8               end grp
  9    from jr_tmp
 10         )
 11         )
 12    group by n, grp2
 13   order by 1
 14  /
MIN(D)     MAX(D)     N     GRP2
—————————— —————————— ————  ————
01-DEC-14  03-DEC-14  1        1
04-DEC-14  05-DEC-14  2        4
06-DEC-14  08-DEC-14  1        6

To fully demonstrate it, I will run it piece by piece to show what is happening, starting with the innermost inline view. I start by “marking” the beginning of each group; I do this by using the LAG analytic function to look back one row in the result set to see if the prior row’s N value was different from the current row’s N value.

If the N value is different, I output a unique marker that denotes the beginning of a new group of records to find the min/max value:

SQL> select d, n,
  2    case when (nvl(lag(n)
       over (order by d),n-1) <> n) then
  3       row_number() over (order by d)
  4        end grp
  5    from jr_tmp
  6  /
D               N      GRP
—————————— ———————— ——————
01-DEC-14        1      1
02-DEC-14        1
03-DEC-14        1
04-DEC-14        2      4
05-DEC-14        2
06-DEC-14        1      6
07-DEC-14        1
08-DEC-14        1
8 rows selected.

Next, I “carry down” the GRP column I just created, so that all the records in the group have the same value. I use the LAST_VALUE analytic function with the IGNORE NULLS option to accomplish this:

SQL> select d, n,
     last_value(grp ignore nulls)
     over (order by d) grp2
  2    from (
  3  select d, n,
  4         case when (nvl(lag(n)
      over (order by d),n-1) <> n) then
  5                   row_number()
      over (order by d)
  6         end grp
  7    from jr_tmp
  8         )
  9  /
D                   N     GRP2
—————————— —————————— ————————
01-DEC-14           1        1
02-DEC-14           1        1
03-DEC-14           1        1
04-DEC-14           2        4
05-DEC-14           2        4
06-DEC-14           1        6
07-DEC-14           1        6
08-DEC-14           1        6
8 rows selected.

Once I have that result set, finding the min/max values for like values of N after sorting by the date column is rather easy.

Now here is where the SQL learning comes in. Over time, as SQL evolves and grows, new approaches become available. Sometimes these new approaches are better-performing and more scalable, sometimes they are more flexible, and sometimes they are just easier to code. A new approach does not rule out the old one; it just gives you more tools to consider when attacking a problem.

In this case, shortly after I posted this Q&A, a frequent Ask Tom reader and contributor, Stew Ashton, posted an alternative approach that uses a new Oracle Database 12c capability. He wrote, “If your refresher course includes [Oracle Database] 12c, you should learn about the MATCH_RECOGNIZE clause!”

select * from jr_tmp
match_recognize(
  order by d
  measures first(d) d_from,
           last(d) d_to, n n
  pattern (a b*)
  define b as n = prev(n)
);
D_FROM    D_TO                  N
————————— ———————————— ——————————
2014-12-01 2014-12-03           1
2014-12-04 2014-12-05           2
2014-12-06 2014-12-08           1

I knew about the MATCH_RECOGNIZE clause—I recently wrote about it in Oracle Magazine. If you read that article, you’ll get some insight into how this approach works. For this current question, if the person who originally posed it had been using Oracle Database 12c, he would likely have found using MATCH_RECOGNIZE to be a better-performing approach than using the multiple layers of analytics. My analytic approach required multiple passes through and sorts of the data, and as the result set grows, the TEMP resources required by the analytics could grow immensely, due to all the intermediate result sets. The row pattern matching MATCH_RECOGNIZE approach, in contrast, would require only one pass.

But because Stew threw out the challenge, I decided to think about this problem again to see if there might be an alternative approach in Oracle Database 11g Release 2 that wasn’t available when I first developed this technique. Upon thinking about it, I found a feature that was introduced in Oracle Database 10g—the MODEL clause—and I used it to provide still another answer to this question, as shown in Listing 1.

Code Listing 1: The MODEL clause solution to “Finding MIN/MAX Values”

SQL> select min(d), max(d), n, grp
  2    from
  3  (
  4  select *
  5    from
  6  ( select d, n, grp, rn
  7      from jr_tmp
  8     model dimension by(row_number() over(order by d) rn)
  9     measures(d, n, 1 grp)
 10     rules(grp[rn=1] = 1,
 11           grp[rn > 1] =
 12             case when (n[cv() - 1] <> n[cv()])
 13                  then grp[cv()-1]+1
 14                  else grp[cv()-1]
 15              end
 16          )
 17  )
 18  )
 19  group by n, grp
 20  order by 1
 21  /
MIN(D)    MAX(D)          N     GRP
————————— —————————— —————— ———————
01-DEC-14 03-DEC-14       1       1
04-DEC-14 05-DEC-14       2       2
06-DEC-14 08-DEC-14       1       3

I got interested in the MODEL clause when someone followed up a complex answer I presented in response to a question with an elegant (and much more scalable) approach that uses pure SQL with the MODEL clause. You can read about that at bit.ly/atmodel.

Limiting the Records

Is it possible to limit the number of records retrieved in a query by using an analytic function? For example, I have

create table family (lastName varchar2(30),
firstName varchar2(30));
insert into family (lastName, firstName)
values ('Afam','Alan');
insert into family (lastName, firstName)
values ('Afam','Ben');
insert into family (lastName, firstName)
values ('Afam','Corey');
insert into family (lastName, firstName)
values ('Bfam','Anna');
insert into family (lastName, firstName)
values ('Bfam','Bobo');
insert into family (lastName, firstName)
values ('Cfam','Ava');
insert into family (lastName, firstName)
values ('Dfam','Anny');
insert into family (lastName, firstName)
values ('Dfam','Bertha');

And I need to be able to query the table and retrieve all the rows where lastName is in the first N occurrences. My idea was to produce a something_like_row_number() function so I can query as follows:

select * from
(select lastName,firstName,
something_like_row_number slr
from family)
where slr<=2;

This something_like_row_number function should start from 1 and increase for every new family:

Afam Alan 1
Afam Ben 1
Afam Corey 1
Bfam Anna 1
Bfam Bobo 2
—————
Cfam Ava 3
Dfam Anny 4
Dfam Bertha 4

Since I really don’t need the rest of my data, is there a way to stop my search right after slr>2?

The person who posed this question is looking for ranking functions. I’ve covered these in the past (bit.ly/attop-n) and decided to use them for this solution:

SQL> select *
  2    from (
     select lastname,
            firstname,
  3         dense_rank() over (
            order by lastname) dr
  4            from family )
  5  /
LASTNAME  FIRSTNAME  DR
———————— —————————— ———
Afam      Alan        1
Afam      Ben         1
Afam      Corey       1
Bfam      Anna        2
Bfam      Bobo        2
Cfam      Ava         3
Dfam      Anny        4
Dfam      Bertha      4
8 rows selected.

This demonstrates the effect of the DENSE_RANK ranking function. It, in effect, assigns a “score” to each set of last names, and once they are assigned, the result becomes rather easy to filter, resulting in only the two groups required:

SQL> select *
  2    from (select lastname, firstname,
  3                 dense_rank() over (
                    order by lastname) dr
  4            from family )
  5   where dr <= 2
  6  /
LASTNAME         FIRSTNAME           DR
———————————————— ——————————————————— ———
Afam             Alan                 1
Afam             Ben                  1
Afam             Corey                1
Bfam             Anna                 2
Bfam             Bobo                 2

I’ve often said that analytics were the best thing to happen to SQL since the keyword SELECT. If you have never used them, you are missing out on one of the most powerful constructs available in SQL today. Analytic functions are part of the SQL standard; they are not a proprietary, nonportable solution. You can get a nice “SQL 101” introduction to them in the series of articles at bit.ly/sql101p10, bit.ly/sql101p11, and bit.ly/sql101p12.

Distinct Count of a Value Across Columns in a Table

I have this table:

select * from country_test;
c1    c2    c3    c4
————— ————— ————— ——————
india us    china uk
india india china uk
india china china uk
us    us    us    uk

I need the distinct count of countries across the c1,c2,c3,c4 columns of the table, so the output has to be

c1    c2    c3    c4   cnt
————— ————— ————— ——— ————
india us    china uk     4
india india china uk     3
india china china uk     3
us    us    us    uk     2

I found this Q&A on Ask Tom to be really fun—fun to write the initial answer and then even more fun to watch all the alternative approaches. You can see the entire progression at bit.ly/atdistc; I’ll showcase a few approaches here.

My initial answer was to use a Cartesian join to turn each of the four columns into four rows—that is, turn each individual row into four separate rows while giving each of these four new rows something “unique” so I could do a COUNT(DISTINCT) on that group. Here was my approach:

SQL> with data(r)
  2  as
  3  (select 1 r from dual
  4   union all
  5   select r+1 from data where r < 4
  6  )
  7  select c1, c2, c3, c4,
     count(distinct c) cnt
  8    from (
  9  select rowid rid,
 10         c1, c2, c3, c4,
 11         decode(
            r,1,c1,2,c2,3,c3,4,c4) c
 12    from data, country_test
 13         )
 14   group by rid, c1, c2, c3, c4
 15  /
C1       C2       C3       C4       CNT
———————— ———————— ———————— ————— ——————
india    us       china    uk         4
us       us       us       uk         2
india    india    china    uk         3
india    china    china    uk         3

This query works by first generating a set of four rows by using a recursive WITH subquery. After that I perform a Cartesian join of those four rows to the original table, turning each single row into four almost duplicate rows. ROWID would be duplicated, but the new column C could differ in these four rows. Next, using DECODE, I populate a single column in each of those rows—one of the four original columns. I keep the primary key (ROWID in this case) so I can safely group by (in case two rows have the same c1..c4 values). Last, all I have to do is group by the duplicate values to get the four rows back into their original one row—while performing the COUNT(DISTINCT) on the C column. (If you have never seen the recursive WITH subquery before, you might want to check out the “Recursive Subquery Factoring” section at bit.ly/atnewed.)

Shortly after I posted this answer, in came a flood of alternatives. This is the part I really like, and this is where I myself learn new techniques. None of these were truly new to me—I’ve used them all before; I just didn’t think to use them in this case. Sometimes I can get stuck in a rut, doing something the way I first discovered 25 years ago.

Stew Ashton rang in with a nice technique:

select c1, c2, c3, c4, column_value cnt
from country_test,
table(cast(multiset(
  select count(distinct c1) from (
    select c1 from dual union all
    select c2 from dual union all
    select c3 from dual union all
    select c4 from dual
  )
) as sys.odcinumberlist));

That solution uses a technique available since Oracle8 Database: table unnesting. With the TABLE clause, each row in the original source table is “joined” to the nested table. In this case, each row in the COUNTRY_TEST table is joined to the single-row result set from the COUNT(DISTINCT) of the four columns. This is potentially a better approach for applications that want to get the first row as quickly as possible (optimizing for initial response time), because the database would not have to perform the Cartesian join and then aggregate results before returning a row. Rather, with table unnesting, you would be able to get a row returned immediately.

My Cartesian join approach would likely be good for applications that want to get the last row as quickly as possible (optimizing for total throughput), because it would tend to utilize far less total CPU. You can prove this hypothesis by creating a large test table, running both queries to completion, and then using PL/SQL to define a cursor that fetches just the first row from each query. You would observe in a TKPROF report that the Cartesian join, in fact, gets to the last row much more quickly than table unnesting and uses significantly less CPU, whereas the table unnesting approach gets the first row instantly; the Cartesian join takes almost as long to return the first row as it does the last row.

Stew later followed up with “Forgot the UNPIVOT!”

select * from country_test,
lateral(
  select count(distinct val) cnt from (
    select c1,c2,c3,c4 from dual
  ) unpivot(val for col in (c1,c2,c3,c4))
);

This approach utilized one new Oracle Database 12c feature—the LATERAL clause—and one new Oracle Database 11g feature—UNPIVOT—to turn columns into rows.

Looking for Sets in Sets

I have this table:

create table test
(
  id  int ,
  x   int
);
insert into test values (1,1);
insert into test values (1,2);
insert into test values (1,3);
insert into test values (1,4);
insert into test values (2,1);
insert into test values (2,2);
insert into test values (2,3);
insert into test values (3,1);
insert into test values (3,2);

I want to get IDs that have exactly x in (1,2). Currently, the query should return ID = 3 only. Here’s a query that returns the current result requirement:

select distinct id
from test
where id in (select id from
             test where x = 1)
and id in (select id from test
           where x = 2)
and id not in (select id from test
where x not in (1,2));

But I also want to write a generic query in case the requirement changes.

I can achieve this in a single query that utilizes bind variables—so it doesn’t have to hard-parse each unique combination—by using a simple outer join and some aggregation. First, I want to be able to generate the target data set as a true result set—not a comma-delimited list,but rather a set of rows. I can do that with a small subquery, as shown in Listing 2. By simply changing the bind variable value to another delimited string, I would generate an entirely different set, using exactly the same SQL query—and doing no hard parsing. (For more information on how this magic works—and other uses of this technique—see bit.ly/atbvary.)

Code Listing 2: Generated result set in DATA table

SQL> variable txt varchar2(20)
SQL> exec :txt := ‘1,2,3’
PL/SQL procedure successfully completed.
SQL> with data
  2  as
  3  (
  4  select
  5    trim( substr (txt,
  6          instr (txt, ',', 1, level  ) + 1,
  7          instr (txt, ',', 1, level+1)
  8             - instr (txt, ',', 1, level) -1 ) )
  9      as token
 10     from (select ','||:txt||',' txt
 11             from dual)
 12   connect by level <=
 13      length(:txt)-length(replace(:txt,',',''))+1
 14   )
 15  select *
 16    from data;
TOKEN
————————————————————————————————————————————
1
2
3

Once I have that small set of data in the DATA table, I am ready to go. All I need to do is find all the IDs in TEST that have at least one X value in this set (those are the candidate IDs), and for all of those IDs, I make an outer join to the DATA table. By using an outer join, I make sure I get all the rows from the TEST table. Once I’ve done that, it is just a matter of counting to make sure that (1) all of the rows in TEST for a given ID are matched up with a DATA row (if not, there are rows in TEST that are not in DATA) and that (2) I’ve received as many rows from TEST for a given ID as there are in DATA. The finished query is in Listing 3.

Code Listing 3: Finished solution for “Looking for Sets in Sets”

SQL> with data
  2  as
  3  (
  4  select
  5    trim( substr (txt,
  6          instr (txt, ',', 1, level  ) + 1,
  7          instr (txt, ',', 1, level+1)
  8             - instr (txt, ',', 1, level) -1 ) )
  9      as token
 10     from (select ','||:txt||',' txt
 11             from dual)
 12   connect by level <=
 13      length(:txt)-length(replace(:txt,',',''))+1
 14   )
 15  select id
 16    from test, data
 17   where id in (select id from test where x in (select * from data) )
 18     and test.x = data.token
 19    group by id
 20   having count(distinct data.token ) = (select count(*) from data)
 21      and count(data.token) = count(*)
 22  /

On line 17 in Listing 3, I retrieve all the rows from TEST such that the ID was in the set of IDs that had an X value in DATA. Once I have those, I make an outer join to DATA on line 18. Again, that ensures that every row I retrieved from TEST would be returned from the join, regardless of whether I had a match in DATA.

On line 19 in Listing 3, I group by the ID column so I can do some counting and verify that, for a given ID, every row has a match in DATA and that for every row in DATA, I have a match in TEST. Those are exactly the IDs I want. On line 20, I verify that the number of rows that have a non-null DATA.TOKEN value equals the number of rows in DATA (every row in DATA has a match in TEST for that ID). On line 21 in Listing 3, I verify that every row for that ID in TEST has a match in DATA by counting just the non-null occurrences of DATA.TOKEN and comparing that result with the COUNT of rows for that ID in TEST. Thinking about this problem in sets is the key to solving it effectively in SQL.

I don’t have room to mention all the cool things you can do in SQL that I’ve addressed on Ask Tom recently, but I can point you to a few more:

For more information on SQL, I encourage you to go to oracle.com/sql and check out the ever-evolving content there.

Next Steps

ASK Tom
Tom Kyte answers your most difficult technology questions. Highlights from that forum appear in this column.

Share

On SQL, SQL, and More SQL

Our technologist says yes, yes, yes to interesting SQL solutions.