Database, SQL and PL/SQL

On Top-n and Pagination Queries

Our technologist gets more results using ROW_NUMBER, RANK, and DENSE_RANK.

By Tom Kyte

January/February 2007

In a recent Ask Tom column (September/October 2006), I wrote about using ROWNUM to perform top- n queries, getting the first n number of rows from an ordered result set, and pagination queries, getting rows n through m from a result set.

In that column, I demonstrated how to use ROWNUM like this:

select *
  from
(your_query)
 where rownum <= 10;

in order to get the first 10 rows from "your query," and how to use ROWNUM like this:

select *
  from
( select rownum rnum, a.*
    from (your_query) a
   where rownum <= :M )
where rnum >= :N;

in order to get rows n through m from "your query."

As always, there is more than one way to do things in Oracle Database, and these top- n and pagination queries are no exception. In this column, I'd like to take a look at alternative ways to achieve this functionality by using analytic functions and introduce another variation on the top- n query: getting the top- n records by some set of attributes , such as getting the "set of rows representing the top three salaries by DEPTNO" from EMP.

Top- N Query

Another way to write a top- n query (which is similar to using the LIMIT clause in MySQL or using "set rowcount" in SQL Server to limit the number of records returned by a query) is to use the analytic function ROW_NUMBER. It behaves similarly to the ROWNUM pseudocolumn but is more flexible and has more capabilities. Specifically, I'll be able to use ROW_NUMBER to get the "top- n records by some set of attributes" in the next section (" Top- n Queries by Something ").

Let's look at what ROW_NUMBER can do. Here is an example query using ROW_NUMBER to assign an increasing number to each row in the EMP table after sorting by SAL DESC:

SQL> select ename, sal,
  2      row_number()
  3       over (order by sal desc) rn
  4  from emp
  5  order by sal desc
  6  /
ENAME    SAL    RN
-----   ----    --
 KING   5000     1
 FORD   3000     2
SCOTT   3000     3
JONES   2975     4
 .
 .
 .
 JAMES    950   13
 SMITH    800   14
14 rows selected.

So, ROW_NUMBER allows me to easily assign this increasing number after sorting, but at first glance, it doesn't seem to support retrieval of just the first n rows:

SQL> select ename, sal,
  2      row_number()
  3       over (order by sal desc) rn
  4  from emp
  5  where
  6   row_number()
  7    over (order by sal desc) <= 3
  8  order by sal desc
  9  /
 row_number()
 *
ERROR at line 6:
ORA-30483: window  functions are not allowed here

The problem here is that analytic functions are evaluated after the WHERE clause and therefore cannot be used in the WHERE clause. That is not really a problem, however, because I have inline views—I can apply a predicate to ROW_NUMBER after it is assigned. For example

SQL> select *
  2  from (
  3  select ename, sal,
  4  row_number()
  5   over (order by sal desc) rn
  6  from emp
  7  )
  8  where rn <= 3
  9  order by sal desc
 10  /
ENAME    SAL    RN
-----   ----    --
 KING   5000     1
SCOTT   3000     2
 FORD   3000     3
3 rows selected.

So, that demonstrates how to perform a top- n query by using ROW_NUMBER and also points out a general issue with top- n queries. If you look at that result, you see two rows with the value 3000. What if, in the EMP table, three people, instead of just two, had a salary of 3000? The result obtained by the above query would be ambiguous—I would get three records, but the records I retrieved would be somewhat random. For example

SQL> update emp
  2  set sal = 3000
  3  where ename = 'SMITH';
1 row updated.
SQL> select *
  2  from (
  3  select ename, sal,
  4      row_number()
  5       over (order by sal desc) rn
  6  from emp
  7  )
  8  where rn <= 3
  9  order by sal desc
 10  /
ENAME    SAL    RN
-----   ----    --
KING    5000     1
SMITH   3000     2
SCOTT   3000     3
3 rows selected.

As you can see, FORD disappeared from this result set. However, you might find that if you perform this update and query your data, FORD does not disappear but rather SCOTT or SMITH does. Which row would disappear would be quite arbitrary. That shows a problem with top- n queries in general: If you are ordering by something nonunique, the first set of n rows you get back may be different, given exactly the same input data. For example

SQL> create table t
  2  ( x number,
  3    y number
  4  );
Table created.
SQL> insert into t
  2  values ( 1, 100 );
SQL> insert into t
  2  values ( 1, 200 );
SQL> insert into t
  2  values ( 1, 300 );
SQL> insert into t
  2  values ( 1, 400 );
SQL> select *
  2  from
  3  (select x,y,
  4    row_number()
  5    over (order by x) rn
  6  from t)
  7  where rn <= 3;
         X     Y   RN
       ---   ---   --
         1   100    1
         1   200    2
         1   300    3
3 rows selected.

Now, here's what happens if I reload this table with the same data, just in a different order :

SQL> truncate table t;
Table truncated.
SQL> insert into t
  2  values ( 1, 400 );
SQL> insert into t
  2  values ( 1, 200 );
SQL> insert into t
  2  values ( 1, 300 );
SQL> insert into t
  2  values ( 1, 100 );
SQL> select *
  2  from
  3  (select x,y,
  4    row_number()
  5    over (order by x) rn
  6  from t)
  7  where rn <= 3;
        X     Y    RN
       ---   ---   --
         1   400    1
         1   200    2
         1   300    3
3 rows selected.

All of a sudden, I get an entirely different answer. This is something to be aware of when you perform top- n queries and the attribute you order by is not unique.

Bearing this in mind, I can use other analytic functions to remove the ambiguity. They will do so, but the analytic functions might return more than n rows. In my opinion, when the attribute I order by is not unique, I want my query to return all of the relevant records—not just the first n arbitrary ones. To that end, I can use the RANK and DENSE_RANK analytic functions. Let's take a look at what they do:

SQL> select ename,sal,
  2   row_number()
  3     over (order by sal desc)rn,
  4   rank()
  5     over (order by sal desc)rnk,
  6   dense_rank()
  7     over (order by sal desc)drnk
  8   from emp
  9  order by sal desc
 10  /
ENAME    SAL   RN   RNK   DRNK
-----   ----   --   ---   ----
 KING   5000    1     1      1
 FORD   3000    2     2      2
SCOTT   3000    3     2      2
JONES   2975    4     4      3
BLAKE   2850    5     5      4
CLARK   2450    6     6      5
 .
 .
 .
14 rows selected.

The main things to note here are the following:

ROW_NUMBER assigns contiguous, unique numbers from 1.. N to a result set.
RANK does not assign unique numbers—FORD and SCOTT tied for second place—nor does it assign contiguous numbers. No record was assigned the value of 3, because two people tied for second place, and no one came in third, according to RANK.
DENSE_RANK, like RANK, does not assign unique numbers, but it does assign contiguous numbers. Even though two records tied for second place, there is a third-place record.

You can use RANK and DENSE_RANK in the same way you would use ROW_NUMBER to restrict the number of rows returned, but obviously you'll get subtly different results. For example

SQL> select *
  2  from (
  3  select ename,sal,
  4   dense_rank()
  5     over (order by sal desc)drnk
  6   from emp
  7  ) where drnk <= 3
  8  order by sal desc
  9  /
ENAME    SAL   DRNK
-----   ----   ----
 KING   5000      1
SCOTT   3000      2
 FORD   3000      2
JONES   2975      3
4 rows selected.

That query returns "the set of people who make the top three salaries," which is likely the desired result. Getting the first three records from EMP after sorting by SAL is rather arbitrary, because using exactly the same set of data, simply inserted in different orders, you could observe different result sets with ROW_NUMBER (because SAL is not unique). Using DENSE_RANK, however, I don't get precisely three records but, instead, a repeatable (deterministic) result set. And I suspect that I retrieve the set the end user really meant to retrieve—the set of people making the top three salaries.

Top- N Queries by SOMETHING

Using analytics is superior to using the ROWNUM pseudocolumn when it comes to processing more-complex queries. For example, suppose you wanted the top- n highest-paid people from the EMP table by DEPTNO . That is, for each department, display the top three highest-paid people (or the set of people making the top three salaries, using DENSE_RANK). To achieve this, I need to break up the result set by DEPTNO, sort it by SAL DESC, and then assign the analytic function result to each row. Fortunately, analytics are designed to do exactly this type of work. For example

SQL> break on deptno skip 1
SQL> select deptno, ename,sal,
  2   row_number()
  3     over (partition by deptno
  4           order by sal desc)rn,
  5   rank()
  6     over (partition by deptno
  7           order by sal desc)rnk,
  8   dense_rank()
  9     over (partition by deptno
 10           order by sal desc)drnk
 11   from emp
 12  order by deptno, sal desc
 13  /
  DEPTNO   ENAME    SAL   RN  RNK  DRNK
--------   -----   ----   --  ---  ----
      10    KING   5000    1    1     1
           CLARK   2450    2    2     2
          MILLER   1300    3    3     3
      20   SCOTT   3000    1    1     1
            FORD   3000    2    1     1
           JONES   2975    3    3     2
           ADAMS   1100    4    4     3
           SMITH    800    5    5     4
      30   BLAKE   2850    1    1     1
           ALLEN   1600    2    2     2
          TURNER   1500    3    3     3
          MARTIN   1250    4    4     4
            WARD   1250    5    4     4
           JAMES    950    6    6     5
14 rows selected.

Here I've used the PARTITION clause of the analytic function. This breaks up my result set virtually into some number of groups and then applies the analytic function to each group in turn, resetting itself as it goes from group to group.

That allows me to assign an increasing number—using ROW_NUMBER, for example—to the rows in each department, resetting the ROW_NUMBER to 1 as I go from group to group. As you can see, it would be quite easy to get the "top 3 by DEPTNO" now, using an inline view and applying the predicate WHERE RN <= 3 or RNK <= 3 or DRNK <= 3, depending on which numbering method returns the answer you deem most applicable.

So, for example, to retrieve the set of employees who make the top three salaries in each of the departments, I would code the following:

SQL> break on deptno skip 1
SQL> select *
  2  from (
  3  select deptno, ename, sal,
  4   dense_rank()
  5     over (partition by deptno
  6           order by sal desc)drnk
  7   from emp
  8  ) where drnk <= 3
  9  order by deptno, sal desc
 10  /
     DEPTNO   ENAME    SAL   DRNK
     ------   -----   ----   ----
         10    KING   5000      1
              CLARK   2450      2
             MILLER   1300      3
         20   SCOTT   3000      1
               FORD   3000      1
              JONES   2975      2
              ADAMS   1100      3
         30   BLAKE   2850      1
              ALLEN   1600      2
             TURNER   1500      3
10 rows selected.

Pagination in Getting Rows N Through M

The last thing I'll take a look at is performing pagination with analytics. Pagination is common in Web-based applications: The user inputs a set of criteria, which runs a query and then lets the user click the Previous and Next buttons to page through the result set. To achieve this paging functionality, the application needs to be able to get a certain set of rows from the query. In the prior column dealing with ROWNUM and in the "Top- n Query" section above, I demonstrated how to achieve that pagination, but it required two levels of inline views and could be considered rather unwieldy. A syntactically easier way to write that pagination query is

select *
 from (
select /*+ first_rows(25) */
  your_columns,
  row_number()
  over (order by something unique)rn
 from your_tables )
where rn between :n and :m
order by rn;

For example, suppose you wanted to paginate through the ALL_OBJECTS view, wanted to show 25 rows on a page, and needed to show page 5. The query would be

SQL> variable n number
SQL> variable m number
SQL> exec :n := 101; :m := 125
PL/SQL procedure successfully completed.
SQL> select *
  2  from (
  3  select /*+ first_rows(25) */
  4   object_id,object_name,
  5   row_number() over
  6     (order by object_id) rn
  7  from all_objects)
  8  where rn between :n and :m
  9  order by rn;
  OBJECT_ID    OBJECT_NAME    RN
  ---------    -----------   ---
        102         I_SEQ1   101
        103     I_OBJAUTH1   102
        .
        .
        .
        124      I_ACCESS1   123
        125      I_TRIGGER1  124
        126      I_TRIGGER2  125
25 rows selected.

I chose 25 for the FIRST_ROWS hint because that is my page size and pagination-style queries should be optimized to get the first page as fast as possible. I assign ROW_NUMBER to each row after sorting by OBJECT_ID. Then I use a simple BETWEEN to retrieve the specific rows I need (101 through 125 in this case).

You need to order by something unique for these pagination queries, so that ROW_NUMBER is assigned deterministically to the rows each and every time. Otherwise, you end up with a problem similar to what I pointed out with the salary top- n query. For more details on deterministic queries, go to oracle.com/technetwork/issue-archive/2006/06-sep/o56asktom-086197.html. Also, for complete information on using analytic functions, see download.oracle.com/docs/cd/B19306_01/server.102/b14223/analysis.htm#i1007779 (chapter 21 of the Data Warehousing Guide ). In my opinion, analytic functions are the coolest thing to happen to SQL since the SELECT keyword was introduced.

Does GROUP BY Sort?

Does a GROUP BY clause in a query guarantee that the output data will be sorted on the GROUP BY columns in order, even if there is no ORDER BY clause?

Unless and until there is an ORDER BY statement on a query, the rows returned cannot be assumed to be in any order. Without an ORDER BY, the data may be returned in any order in which the database feels like returning it. This has always been true and will always be true.

In fact, in Oracle Database 10g Release 2, you'll see GROUP BY returning data in a random order much more often than before:

SQL> set autotrace on explain
SQL> select job, count(*)
  2    from emp
  3   group by job
  4  /
JOB          COUNT(*)
---------    --------
CLERK               4
SALESMAN            4
PRESIDENT           1
MANAGER             3
ANALYST             2
-----------------------------------
|  Id  | Operation         | Name |
-----------------------------------
|   0  | SELECT STATEMENT  |      |
|   1  | HASH GROUP BY     |      |
|   2  | TABLE ACCESS FULL | EMP  |
-----------------------------------

The HASH GROUP BY step in the explain plan is a newer and generally faster and more efficient query plan step for a GROUP BY operation, and it definitely doesn't sort data.

Don't be misled if the results are not the same on your system. It does not matter if, when you execute that query, the results are sorted on your system—it takes only one countercase to demonstrate that something is not always true. Only with an ORDER BY can you make any assumptions about the sorted order of data. Many things—from parallel query to partitioning, to reverse-key indexes, to hash-partitioned indexes, to HASH GROUP BY steps, to the presence (or lack) of B*tree indexes—can and will cause data not to be returned in the sorted order you anticipated.

If you need data sorted, you have to use an ORDER BY statement; there is no avoiding that.

Next Steps

ASK Tom
Oracle Vice President Tom Kyte answers your most difficult technology questions. Highlights from that forum appear in this column.
asktom.oracle.com

DOWNLOAD Oracle Database 10g Express Edition (Oracle Database XE)

Share

On Top-n and Pagination Queries

Our technologist gets more results using ROW_NUMBER, RANK, and DENSE_RANK.