Skip to Main Content
  • Questions
  • Fast Query To Get TOP-N Distinct Rows From A Huge Data Table

Breadcrumb

Dev Live Dev Intro

This month we are celebrating Developers at AskTOM. We welcome Developers of all levels of experience to join us at our FREE Developer Live events coming in August. Just click on the left to register today! If you are brand new to Database Technology, then we also have got you covered. Just click on the right for your comprehensive FREE training program to kick start your Oracle Database Development journey!

Question and Answer

Tom Kyte

Thanks for the question, Saed.

Asked: June 04, 2013 - 2:47 pm UTC

Answered by: Tom Kyte - Last updated: June 06, 2013 - 6:10 pm UTC

Category: Database - Version: 11.2.0.2.0

Viewed 10K+ times! This question is

You Asked

Hello Tom,

I am using the query below [1] to search the table and get the first unique N result within milliseconds, but the execution will execute the DISTINCT after getting the first N result which will reduce the rows to be less than N rows which means incomplete result. To solve this I separated the DISTINCT and ROWNUM into two nested select statements as shown below [2] but the performance has been degraded especially when it used for huge data tables.

Could you please advice or provide me with a better solution that has a good performance (consider that the query might have joins statements) Specifically waht I need is just like the following MS SQL query [SELECT DISTINCT TOP N FROM TABLE WHERE COLUMN like '%A%']

CREATE TABLE PERSON_INFO (FIRST_NAME varchar2(15), LAST_NAME varchar2(15));

insert into person_info values ('ABC','1');
insert into person_info values ('DBC','2');
insert into person_info values ('ABC','3');
insert into person_info values ('ABC','4');
insert into person_info values ('ADE','5');
insert into person_info values ('ADE','6');
insert into person_info values ('ADE','7');
insert into person_info values ('ADE','8');
INSERT INTO PERSON_INFO VALUES ('ADE','9');


-- [1] Fast but Incomplete result (The expected result was [ABC, ADE]) (Actual result is [ABC])
select distinct FIRST_NAME from person_info where FIRST_NAME like '%A%' and rownum <=3

-- [2] Slow but Complete result
select * from (select distinct FIRST_NAME from person_info where FIRST_NAME like '%A%') where rownum <=3


Thanks.

and we said...

MS SQL wouldn't be any faster with SELECT DISTINCT TOP N FROM TABLE WHERE COLUMN like '%A%', that is semantically equivalent to:

select * from (select distinct FIRST_NAME from person_info where FIRST_NAME
like '%A%') where rownum <=3

we both have to get the distinct values and then return 3 rows from it.


this query is hugely suspect to me - you do understand that two people running the same query against the same exact set of data can get different answers - and both would be correct - right? You understand that? your query is not deterministic. A different sized workarea or a different plan could well return different (but correct) answers.


You can approach it this way, it depends on the presence of an index. The index will make the query itself faster (much less data to scan), and the index can be used to return the first 3 unique hits quickly. For example:

ops$tkyte%ORA11GR2> create table t
  2  as
  3  select * from all_objects;

Table created.

ops$tkyte%ORA11GR2> 
ops$tkyte%ORA11GR2> create index t_idx on t(owner);

Index created.

ops$tkyte%ORA11GR2> exec dbms_stats.gather_table_stats( user, 'T' );

PL/SQL procedure successfully completed.

ops$tkyte%ORA11GR2> 
ops$tkyte%ORA11GR2> set autotrace on
ops$tkyte%ORA11GR2> select *
  2    from (
  3  select owner, count(*)
  4    from t
  5   where owner like '%A%'
  6   group by owner
  7         )
  8   where rownum <= 3
  9  /

OWNER                            COUNT(*)
------------------------------ ----------
A                                       2
APEX_030200                          2251
APPQOSSYS                               5


Execution Plan
----------------------------------------------------------
Plan hash value: 112435869

--------------------------------------------------------------------------------
| Id  | Operation              | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |       |     3 |    90 |     2   (0)| 00:00:01 |
|*  1 |  COUNT STOPKEY         |       |       |       |            |          |
|   2 |   VIEW                 |       |     4 |   120 |     2   (0)| 00:00:01 |
|   3 |    SORT GROUP BY NOSORT|       |     4 |    28 |     2   (0)| 00:00:01 |
|*  4 |     INDEX FULL SCAN    | T_IDX |  3849 | 26943 |     2   (0)| 00:00:01 |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(ROWNUM<=3)
   4 - filter("OWNER" LIKE '%A%')


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
         12  consistent gets
          0  physical reads
          0  redo size
        552  bytes sent via SQL*Net to client
        420  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          3  rows processed

<b>Notice that there are 12 IO's against the index to get the first three unique hits (group by would do that for you - the unique bit)..</b>

ops$tkyte%ORA11GR2> select *
  2    from (
  3  select owner, count(*)
  4    from t
  5   where owner like '%A%'
  6   group by owner
  7         )
  8   where rownum <= 5
  9  /

OWNER                            COUNT(*)
------------------------------ ----------
A                                       2
APEX_030200                          2251
APPQOSSYS                               5
MAIN_DB                                 2
OLAPSYS                               719


Execution Plan
----------------------------------------------------------
Plan hash value: 112435869

--------------------------------------------------------------------------------
| Id  | Operation              | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |       |     5 |   150 |     2   (0)| 00:00:01 |
|*  1 |  COUNT STOPKEY         |       |       |       |            |          |
|   2 |   VIEW                 |       |     6 |   180 |     2   (0)| 00:00:01 |
|   3 |    SORT GROUP BY NOSORT|       |     6 |    42 |     2   (0)| 00:00:01 |
|*  4 |     INDEX FULL SCAN    | T_IDX |  3849 | 26943 |     2   (0)| 00:00:01 |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(ROWNUM<=5)
   4 - filter("OWNER" LIKE '%A%')


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
         30  consistent gets
          0  physical reads
          0  redo size
        585  bytes sent via SQL*Net to client
        420  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          5  rows processed

<b>Now, to get the first 5, we had to process 30 blocks... My owner column has a ton of repeats...</b>

ops$tkyte%ORA11GR2> select *
  2    from (
  3  select owner, count(*)
  4    from t
  5   where owner like '%A%'
  6   group by owner
  7         )
  8   where rownum <= 30
  9  /

OWNER                            COUNT(*)
------------------------------ ----------
A                                       2
APEX_030200                          2251
APPQOSSYS                               5
MAIN_DB                                 2
OLAPSYS                               719
ORACLE_OCM                              8
ORDDATA                               239
OWBSYS_AUDIT                           12
PERFSTAT                              148
SI_INFORMTN_SCHEMA                      8
SYSMAN                               3392

11 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 112435869

--------------------------------------------------------------------------------
| Id  | Operation              | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |       |    30 |   900 |     3   (0)| 00:00:01 |
|*  1 |  COUNT STOPKEY         |       |       |       |            |          |
|   2 |   VIEW                 |       |    31 |   930 |     3   (0)| 00:00:01 |
|   3 |    SORT GROUP BY NOSORT|       |    31 |   217 |     3   (0)| 00:00:01 |
|*  4 |     INDEX FULL SCAN    | T_IDX |  3849 | 26943 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(ROWNUM<=30)
   4 - filter("OWNER" LIKE '%A%')


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
        186  consistent gets
          0  physical reads
          0  redo size
        703  bytes sent via SQL*Net to client
        420  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
         11  rows processed

<b>to get 30 - we had to process yet more blocks and so on. </b>
ops$tkyte%ORA11GR2> 
ops$tkyte%ORA11GR2> set autotrace off



This shows that - the less data you need, the less work we'll do. The more you need - the more we'll do.

Depending on the uniqueness of the data - this could have a massive impact on your query speed. In the above, my data was not very unique at all - but if we make it almost unique:

ops$tkyte%ORA11GR2> create table t
  2  as
  3  select * from all_objects;

Table created.

ops$tkyte%ORA11GR2> 
ops$tkyte%ORA11GR2> create index t_idx on t(object_name);

Index created.

ops$tkyte%ORA11GR2> exec dbms_stats.gather_table_stats( user, 'T' );

PL/SQL procedure successfully completed.

ops$tkyte%ORA11GR2> 
ops$tkyte%ORA11GR2> set autotrace on
ops$tkyte%ORA11GR2> select *
  2    from (
  3  select object_name, count(*)
  4    from t
  5   where object_name like '%A%'
  6   group by object_name
  7         )
  8   where rownum <= 3
  9  /

OBJECT_NAME                      COUNT(*)
------------------------------ ----------
/10128284_OpenMBeanAttributeIn          2
/10297c91_SAXAttrList                   2
/10378bc5_IA64BaseLIRInstrFSAB          2


Execution Plan
----------------------------------------------------------
Plan hash value: 112435869

--------------------------------------------------------------------------------
| Id  | Operation              | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |       |     3 |    90 |     3   (0)| 00:00:01 |
|*  1 |  COUNT STOPKEY         |       |       |       |            |          |
|   2 |   VIEW                 |       |     4 |   120 |     3   (0)| 00:00:01 |
|   3 |    SORT GROUP BY NOSORT|       |     4 |   100 |     3   (0)| 00:00:01 |
|*  4 |     INDEX FULL SCAN    | T_IDX |  3849 | 96225 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(ROWNUM<=3)
   4 - filter("OBJECT_NAME" LIKE '%A%')


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
          4  consistent gets
          0  physical reads
          0  redo size
        614  bytes sent via SQL*Net to client
        420  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          3  rows processed

ops$tkyte%ORA11GR2> select *
  2    from (
  3  select object_name, count(*)
  4    from t
  5   where object_name like '%A%'
  6   group by object_name
  7         )
  8   where rownum <= 5
  9  /

OBJECT_NAME                      COUNT(*)
------------------------------ ----------
/10128284_OpenMBeanAttributeIn          2
/10297c91_SAXAttrList                   2
/10378bc5_IA64BaseLIRInstrFSAB          2
/104f94f3_AiffFileWriter                2
/105f39a1_AQDequeueOptions              2


Execution Plan
----------------------------------------------------------
Plan hash value: 112435869

--------------------------------------------------------------------------------
| Id  | Operation              | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |       |     5 |   150 |     3   (0)| 00:00:01 |
|*  1 |  COUNT STOPKEY         |       |       |       |            |          |
|   2 |   VIEW                 |       |     6 |   180 |     3   (0)| 00:00:01 |
|   3 |    SORT GROUP BY NOSORT|       |     6 |   150 |     3   (0)| 00:00:01 |
|*  4 |     INDEX FULL SCAN    | T_IDX |  3849 | 96225 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(ROWNUM<=5)
   4 - filter("OBJECT_NAME" LIKE '%A%')


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
          4  consistent gets
          0  physical reads
          0  redo size
        676  bytes sent via SQL*Net to client
        420  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          5  rows processed

ops$tkyte%ORA11GR2> select *
  2    from (
  3  select object_name, count(*)
  4    from t
  5   where object_name like '%A%'
  6   group by object_name
  7         )
  8   where rownum <= 30
  9  /

OBJECT_NAME                      COUNT(*)
------------------------------ ----------
/10128284_OpenMBeanAttributeIn          2
/10297c91_SAXAttrList                   2
/10378bc5_IA64BaseLIRInstrFSAB          2
/104f94f3_AiffFileWriter                2
/105f39a1_AQDequeueOptions              2
/1065f59e_DescriptorAccess              2
/106ba0a5_ArrayEnumeration              2
/108b4b6_AnyImpl                        2
/109dbb46_MetalLookAndFeelFont          2
/109def6d_AMD64AbstractMIR2LIR          2
/10dbe49_AdapterInactiveHelper          2
/10eb5fef_AnyNodeCounter                2
/11195b83_OrdAVIVideoEncoder1           2
/111f1d0f_AlternateIIOPAddress          2
/113a521c_AppInputStream                2
/1169ca20_AdapterActivator              2
/117c8d34_ArrayBlockingQueueIt          2
/11ab5385_JTextComponentAccess          2
/11dee1d7_AQjmsXAResourceFacto          2
/11e6290d_AtomicLongFieldUpdat          2
/11eac1b7_DGCAckHandler1                2
/11fe6909_ArrayInstruction              2
/1225be9b_MonitoredAttributeBa          2
/1228e36d_AMD64AbstractMIR2LIR          2
/122bb5b3_AQjmsExceptionListen          2
/122e6c15_OracleSqljXADataSour          2
/126fa54b_DicomUidDefS_CLASSIF          2
/128fe27c_NameAndTypeConstantD          2
/1297df4c_ToHTMLSAXHandler              2
/129960c0_ArrayNotificationBuf          2

30 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 112435869

--------------------------------------------------------------------------------
| Id  | Operation              | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |       |    30 |   900 |     6   (0)| 00:00:01 |
|*  1 |  COUNT STOPKEY         |       |       |       |            |          |
|   2 |   VIEW                 |       |    31 |   930 |     6   (0)| 00:00:01 |
|   3 |    SORT GROUP BY NOSORT|       |    31 |   775 |     6   (0)| 00:00:01 |
|*  4 |     INDEX FULL SCAN    | T_IDX |  3849 | 96225 |     6   (0)| 00:00:01 |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(ROWNUM<=30)
   4 - filter("OBJECT_NAME" LIKE '%A%')


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
          7  consistent gets
          0  physical reads
          0  redo size
       1650  bytes sent via SQL*Net to client
        431  bytes received via SQL*Net from client
          3  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
         30  rows processed

ops$tkyte%ORA11GR2> 
ops$tkyte%ORA11GR2> set autotrace off



you can see we can get a relative large result set back - with very very very little processing.

and you rated our response

  (8 ratings)

Is this answer out of date? If it is, please let us know via a Review

Reviews

I beg to disagree

June 04, 2013 - 3:42 pm UTC

Reviewer: Sokrates

the query is not deterministic, right, but every top-n-query with no order by is not deterministic.

I don't agree with:

MS SQL wouldn't be any faster with SELECT DISTINCT TOP N FROM TABLE WHERE COLUMN like '%A%', that is semantically equivalent to:

select * from (select distinct FIRST_NAME from person_info where FIRST_NAME
like '%A%') where rownum <=3

we both have to get the distinct values and then return 3 rows from it.


no, you don't have to get all the distinct values.

see
CREATE TABLE PERSON_INFO (FIRST_NAME varchar2(15), LAST_NAME varchar2(15));

insert into person_info values ('ABC','1');
insert into person_info values ('DBC','2');
insert into person_info values ('ABC','3');
insert into person_info values ('ABC','4');
insert into person_info values ('ADE','5');
insert into person_info values ('ADE','6');
insert into person_info values ('ADE','7');
insert into person_info values ('ADE','8');
INSERT INTO PERSON_INFO VALUES ('ADE','9');

create type stringtable is table of varchar2(15);
/

create or replace function distinct_top_n( pattern in varchar2, top_n in int ) return stringtable pipelined is
type already_t is table of char(1) index by varchar2(15);
already_piped already_t;
co int := 0;
begin
   for c in (
      select *
      from PERSON_INFO
      where first_name like pattern
   ) loop
      begin
         if already_piped(c.first_name) = 'x' then null; end if;
      exception when no_data_found then 
         pipe row(c.first_name); already_piped(c.first_name) := 'x'; co := co + 1;
      end;
      
      exit when co >= top_n;
   end loop;
end distinct_top_n;
/

select * from table(distinct_top_n(pattern => '%A%', top_n => 3));

COLUMN_VALUE
---------------
ABC
ADE


the table doesn't have to be read fully.
This doesn't mean I would advise such a solution, but shows that there exists algorithms for for DISTINCT TOP N queries which don't have to get all distinct values.
Probably MS SQL has implemented such an algorithm
Tom Kyte

Followup  

June 04, 2013 - 4:24 pm UTC

... the query is not deterministic, right, but every top-n-query with no order by is not deterministic.
...


right - and even with an order by it might not be deterministic. It is what I was pointing out. It makes me suspicious, suspicious that the query is wrong in the first place, ill specified.



I demonstrated a way to query the data that achieves what your procedure code does, without any procedural code, in Oracle.

@sokrates

June 04, 2013 - 4:42 pm UTC

Reviewer: Rajeshwaran

the table doesn't have to be read fully - But I see the table "PERSON_INFO" is read fully. Oracle reads all rows matching predicates. Here is what I see from Tkprof.

SELECT * 
FROM
 PERSON_INFO WHERE FIRST_NAME LIKE :B1 


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          1          0           0
Fetch        1      0.00       0.00          0          7          0           8
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        3      0.00       0.00          0          8          0           8

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 84     (recursive depth: 1)

Rows     Row Source Operation
-------  ---------------------------------------------------
      8  TABLE ACCESS FULL PERSON_INFO (cr=7 pr=0 pw=0 time=0 us cost=3 size=144 card=8)



Tom Kyte

Followup  

June 04, 2013 - 7:54 pm UTC

he is saying that to get three random distinct values, the table doesn't have to be read fully (conceptually it would). I demonstrated that with the group by.

How about With clause or Mat. View

June 04, 2013 - 11:04 pm UTC

Reviewer: A reader

Try this

with dis as (select distinct first_name from person_info)
select first_name from dis where FIRST_NAME like '%A%' and rownum <=3

or

If realtime data no needed

Create a Mview of distinct records and query that
Tom Kyte

Followup  

June 05, 2013 - 12:02 am UTC

well, your with query won't be any better - the group by I demonstrated above can get you the answer without having to get the ENTIRE answer. Your with query will full scan every row - the group by won't.

The materialized view of distinct records would technically work - but - again the group by does it without requiring the maintenance of a materialized view at all.

the group by won't full scan every row

June 05, 2013 - 9:41 am UTC

Reviewer: Sokrates


I demonstrated a way to query the data that achieves what your procedure code does, without any procedural code, in Oracle.


Very nice demonstration !

I thought
select *
from (
select owner
from t
where owner like '%A%'
group by owner
)
where rownum<=3


would be the same as
select *
from (
select distinct owner
from t
where owner like '%A%'
)
where rownum<=3


but apparently they are handled differently by the optimizer.

Tom Kyte

Followup  

June 05, 2013 - 2:10 pm UTC

yeah, the sort group by nosort is rather neat, it allows a group by to return immediately in the presence of an index on the group by elements.

June 05, 2013 - 2:45 pm UTC

Reviewer: A reader

Hi Tom,


<quote from your above comment>

select * from (select distinct FIRST_NAME from person_info where FIRST_NAME
like '%A%') where rownum <=3

we both have to get the distinct values and then return 3 rows from it.


this query is hugely suspect to me - you do understand that two people running the same query against the same exact set of data can get different answers - and both would be correct - right? You understand that? your query is not deterministic. A different sized workarea or a different plan could well return different (but correct) answers.

<quote from your above comment>

Question

(1) is it optimizer push rownum predicate in above query and thats the reason of your above comment?

(2) Also how distinct and group by handle differenly by optimizer and why optimizer don't apply same for distinct version of query?



select *
from (
select owner
from t
where owner like '%A%'
group by owner
)
where rownum<=3




select *
from (
select distinct owner
from t
where owner like '%A%'
)
where rownum<=3


Thanks in Advance

Tom Kyte

Followup  

June 05, 2013 - 4:06 pm UTC

(1) no, that isn't the reason, the rownum is not, will not, be pushed into the inner query.


that query says "get a distinct list of values, return the first three"


maybe when you generate the distinct list of values, it generates "a,d,c,b,e,f,z" and maybe when it does it for me, it generates "z,f,b,e,c,a,d". Your first three are a,d,c, mine are z,f,b


distinct doesn't have to sort, the list of returned items doesn't have to be in the same order every time


2) semantically, they are the same, but the group by currently can be better optimized. for example:

ops$tkyte%ORA11GR2> drop table t;
ops$tkyte%ORA11GR2> 
ops$tkyte%ORA11GR2> create table t as
ops$tkyte%ORA11GR2> select rownum x, a.* from all_objects a;
ops$tkyte%ORA11GR2> 
ops$tkyte%ORA11GR2> create index t_idx on t(x);
ops$tkyte%ORA11GR2> exec dbms_stats.gather_table_stats( user, 'T' );



tkprof will show:

select * from (select distinct x from t where x > 1) where rownum <= 3

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        2      0.06       0.06          0        169          0           3
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        4      0.06       0.06          0        169          0           3


Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         3          3          3  COUNT STOPKEY (cr=169 pr=0 pw=0 time=63797 us)
         3          3          3   VIEW  (cr=169 pr=0 pw=0 time=63776 us cost=271 
         3          3          3    SORT GROUP BY STOPKEY (cr=169 pr=0 pw=0 time=6
     72940      72940      72940     INDEX FAST FULL SCAN T_IDX (cr=169 pr=0 pw=0 
********************************************************************************
select * from (select x from t where x > 1 group by x) where rownum <= 3

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        2      0.00       0.00          0          3          0           3
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        4      0.00       0.00          0          3          0           3

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         3          3          3  COUNT STOPKEY (cr=3 pr=0 pw=0 time=68 us)
         3          3          3   VIEW  (cr=3 pr=0 pw=0 time=58 us cost=2 size=39 
         3          3          3    SORT GROUP BY NOSORT (cr=3 pr=0 pw=0 time=53 u
         4          4          4     INDEX RANGE SCAN T_IDX (cr=3 pr=0 pw=0 time=44 



the group by stopped reading the index because the sort group by nosort step can start feeding the count stopkey step right away, whereas the distinct chose a sort group by stopkey - which processed all of the rows and then returned the first three it had. Some day in the future, the optimizer might look at those two queries and say 'they are the same', currently it does not.

June 06, 2013 - 3:57 pm UTC

Reviewer: A reader

Thanks for great expalanation!!! this is really helpfull

just one last point to clarify:

since distinct query in your example goes for "sort group by" is it safe to say the result will be deterministic as oppose to "hash group by" which does not follow any ordering, right?


Thank you,




Tom Kyte

Followup  

June 06, 2013 - 4:22 pm UTC

the result is not deterministic. a simple change in plan can and will change the result.

suppose you were distincting on two columns - A, B. You where on A. You group by A,B. At some point there is an index on A,B - but it is decided that B,A would be better - that'll change the order of the rows returned.


ops$tkyte%ORA11GR2> create table t ( a varchar2(30), b varchar2(30), data varchar2(100) );

Table created.

ops$tkyte%ORA11GR2> 
ops$tkyte%ORA11GR2> insert into t
  2  select object_name, ename, rpad( 'x', 100, 'x' )
  3    from (select object_name, rownum x from all_objects where rownum <= 14),
  4         (select ename, rownum y from scott.emp)
  5    where x = y
  6  /

14 rows created.

ops$tkyte%ORA11GR2> 
ops$tkyte%ORA11GR2> create index t_idx on t(a,b);

Index created.

ops$tkyte%ORA11GR2> select *
  2    from (
  3  select a, b
  4    from t
  5   where a like '%$%'
  6   group by a, b
  7         )
  8   where rownum <= 3;

A                              B
------------------------------ ------------------------------
CON$                           WARD
FILE$                          JAMES
ICOL$                          SMITH

ops$tkyte%ORA11GR2> 
ops$tkyte%ORA11GR2> drop index t_idx;

Index dropped.

ops$tkyte%ORA11GR2> create index t_idx on t(b,a)
  2  /

Index created.

ops$tkyte%ORA11GR2> select *
  2    from (
  3  select a, b
  4    from t
  5   where a like '%$%'
  6   group by a, b
  7         )
  8   where rownum <= 3;

A                              B
------------------------------ ------------------------------
I_PROXY_ROLE_DATA$_1           ADAMS
PROXY_ROLE_DATA$               CLARK
UET$                           FORD




June 06, 2013 - 5:22 pm UTC

Reviewer: A reader

Thanks a lot!! so to make it deterministec we need to give "order by" before getting top N :)


Tom Kyte

Followup  

June 06, 2013 - 6:10 pm UTC

yes, and the plan should not change - even with the order by. But the order by would assure you that the result of this particular query would be deterministic given a static set of data, it would not be subject to different results due to different plans/structures.

dirty query, undeterministic performance

June 13, 2013 - 3:15 pm UTC

Reviewer: pacmann from france

Just for fun, no index, no full scan (if you are lucky), no pl :

SELECT case level when 1 then afn when 2 then bfn when 3 then cfn end first_name
select *
from (
select a.FIRST_NAME afn, b.FIRST_NAME bfn, c.FIRST_NAME cfn
from person_info a
cross join person_info b
cross join person_info c
where a.FIRST_NAME like '%A%'
and b.FIRST_NAME like '%A%'
and c.FIRST_NAME like '%A%'
and a.first_name <> b.first_name
and a.first_name <> c.first_name
and b.first_name <> c.first_name
)
rownum <=1
)
CONNECT BY LEVEL <= 3

If you are not lucky (ie your data is not very distinct), it will never end, especially if you need more distinct values :)

More to Explore

DBMS_STATS

More on PL/SQL routine DBMS_STATS here