Thanks for the question, Pankaj.

Asked: January 30, 2005 - 10:39 pm UTC

Last updated: May 30, 2013 - 2:50 pm UTC

Version: 8.1.7

Viewed 100K+ times! This question is

You Asked

Tom,

Just want to know the difference between DISTINCT and GROUP BY in queries where I'm not using any aggregate functions.

Like for example.

Select emp_no, name from Emp
Group by emo_no, name

And

Select distinct emp_no, name from emp;

Which one is faster and why ?

Thanks

and Tom said...

they are for all intents and purposes the same... and really easy for you to evaluate!!

if you run:

set linesize 121
set echo on

drop table t;

create table t
as
select * from all_objects;

alter table t add constraint t_pk primary key(object_id);
exec dbms_stats.gather_table_stats( user, 'T', cascade=>true );

set autotrace traceonly
select distinct owner, object_name, object_type from t;
select owner, object_name, object_type from t group by owner, object_name, object_type;
set autotrace off
alter session set sql_trace=true;
set autotrace traceonly
select distinct owner, object_name, object_type from t;
select distinct owner, object_name, object_type from t;
select distinct owner, object_name, object_type from t;
select owner, object_name, object_type from t group by owner, object_name, object_type;
select owner, object_name, object_type from t group by owner, object_name, object_type;
select owner, object_name, object_type from t group by owner, object_name, object_type;
set autotrace off

select distinct owner, object_name, object_type from t

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 3 0.00 0.00 0 0 0 0
Execute 3 0.00 0.00 0 0 0 0
Fetch 9525 0.92 0.83 0 1992 0 142815
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 9531 0.92 0.83 0 1992 0 142815

Rows Row Source Operation
------- ---------------------------------------------------
47605 SORT UNIQUE (cr=664 pr=0 pw=0 time=177034 us)
47938 TABLE ACCESS FULL T (cr=664 pr=0 pw=0 time=48087 us)
********************************************************************************
select owner, object_name, object_type from t group by owner, object_name, object_type

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 3 0.00 0.00 0 0 0 0
Execute 3 0.00 0.00 0 0 0 0
Fetch 9525 0.93 0.85 0 1992 0 142815
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 9531 0.93 0.85 0 1992 0 142815

Rows Row Source Operation
------- ---------------------------------------------------
47605 SORT GROUP BY (cr=664 pr=0 pw=0 time=166792 us)
47938 TABLE ACCESS FULL T (cr=664 pr=0 pw=0 time=48006 us)

Rating

(45 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Thanks Tom

Pankaj, January 31, 2005 - 9:38 am UTC

As always, thanks tom.

DISTINCT issues

Mike Angelastro, December 19, 2005 - 2:33 pm UTC

Isn't using a "DISTINCT" sometimes a sign of a query that hasn't been fully thought out? Whenever I create a query, I run it with and without a "DISTINCT" and, if there is a difference in the record counts, I try to figure out why. Usually, if the record counts are different, there is something I hadn't considered.

December 19, 2005 - 4:04 pm UTC

it *could* mean that.

it doesn't have to mean that.

A reader, January 19, 2006 - 3:36 am UTC

Direct Answer

A reader, May 11, 2006 - 8:40 pm UTC

It is always nice to see an answer backed up with data rather than conjecture.

Distinct vs Group By

Victor, October 05, 2006 - 4:13 am UTC

Hi there. I disagree with the statement that they are the same.

I had a statement, built of two nested "group by" tables, and it hung... changed to "distinct" and voila! 1.5min execution.

I think distinct *pure thumbsuck* only takes one of each result value, where group by takes all results, and summarises the result into unique values. This is support by the added functionality group by has - HAVING!

ok then
Victor

October 05, 2006 - 8:14 am UTC

they are the same in that the results they return are ....... ta-dah - the same. They have the same effect.

Just like:

select * from t where x in ( select y from t2 )

is the "same" as:

select t.* from t, (select distinct y from t2) t2 where t.x = t2.y;

they may well perform differently (different plans) but they are "interchangable, the same"

DISTINCT de-mythified by Prometheus

Duke Ganote, October 05, 2006 - 9:55 am UTC

Oh, this takes me back-- one of the rule-of-thumb (ROT) myths I remember hearing from crusty DBAs when I started working with Oracle DBMS late last century:
* "always use GROUP BY instead of DISTINCT because it performs better."
* "Never use outer joins; make the application do that"
My first purchased book, "Oracle 8i: The Complete Reference", had a scant page on explain plan that illustrated just a few lines of cryptic-looking output...

Then early in this century I noticed a piece in Oracle magazine by a guy who knew how to get performance statistics from the DBMS, interpret the results, and who used the experimental method to test hypotheses. I was impressed. And he writes some great books!

Thanks (exponentially), Tom !

David Aldridge, October 05, 2006 - 5:03 pm UTC

I ran exactly the same test in 10.2 just to confirm that nothing about the HASH GROUP BY changed this, and noticed that the distinct query used HASH UNIQUE, which made me initially believe that both operations were still internally the same.

However the tkprof'd trace file showed the DISTINCT operation using more CPU than the GROUP BY ...

select distinct owner, object_name, object_type
from
t

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 3 0.00 0.00 0 0 0 0
Execute 3 0.00 0.00 0 0 0 0
Fetch 10095 1.85 1.78 45 2106 0 151374
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 10101 1.85 1.79 45 2106 0 151374

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 62

Rows Row Source Operation
------- ---------------------------------------------------
50458 HASH UNIQUE (cr=702 pr=45 pw=45 time=2405972 us)
50750 TABLE ACCESS FULL T (cr=702 pr=0 pw=0 time=2334900 us)

********************************************************************************

********************************************************************************

select owner, object_name, object_type from t group by owner, object_name,
object_type

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 3 0.00 0.00 0 0 0 0
Execute 3 0.00 0.00 0 0 0 0
Fetch 10095 1.18 1.08 0 2106 0 151374
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 10101 1.19 1.08 0 2106 0 151374

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 62

Rows Row Source Operation
------- ---------------------------------------------------
50458 HASH GROUP BY (cr=702 pr=0 pw=0 time=624258 us)
50750 TABLE ACCESS FULL T (cr=702 pr=0 pw=0 time=507831 us)

********************************************************************************

If you can't reproduce at your end Tom then I can send trace files and whatnot.

October 06, 2006 - 8:26 am UTC

I would say those numbers are close enough for me to call them "the same" over time. (look at your elapsed times.... for example)


ops$tkyte%ORA10GR2> create table t as select * from all_objects;

Table created.

ops$tkyte%ORA10GR2> insert /*+ APPEND */ into t select * from t;

50036 rows created.

ops$tkyte%ORA10GR2> commit;

Commit complete.

ops$tkyte%ORA10GR2> insert /*+ APPEND */ into t select * from t;

100072 rows created.

ops$tkyte%ORA10GR2> commit;

Commit complete.

ops$tkyte%ORA10GR2>
ops$tkyte%ORA10GR2> begin
  2          for i in 1 .. 1
  3          loop
  4                  for x in (select distinct owner, object_name, object_type from t)
  5                  loop
  6                          null;
  7                  end loop;
  8                  for x in (select owner, object_name, object_type from t group by owner, object_name, object_type)
  9                  loop
 10                          null;
 11                  end loop;
 12          end loop;
 13  end;
 14  /

PL/SQL procedure successfully completed.

ops$tkyte%ORA10GR2> @trace
ops$tkyte%ORA10GR2> alter session set events '10046 trace name context forever, level 12';

Session altered.

ops$tkyte%ORA10GR2> begin
  2          for i in 1 .. 5
  3          loop
  4                  for x in (select distinct owner, object_name, object_type from t)
  5                  loop
  6                          null;
  7                  end loop;
  8                  for x in (select owner, object_name, object_type from t group by owner, object_name, object_type)
  9                  loop
 10                          null;
 11                  end loop;
 12          end loop;
 13  end;
 14  /

PL/SQL procedure successfully completed.





SELECT DISTINCT OWNER, OBJECT_NAME, OBJECT_TYPE FROM T
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          1          0           0
Execute      5      0.00       0.00          0          0          0           0
Fetch     2475      1.54       1.59        165      13805          0      247425
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total     2481      1.54       1.59        165      13806          0      247425

Rows     Row Source Operation
-------  ---------------------------------------------------
 247425  HASH UNIQUE (cr=13805 pr=165 pw=165 time=1381040 us)
1000720   TABLE ACCESS FULL T (cr=13805 pr=0 pw=0 time=1000940 us)



********************************************************************************
SELECT OWNER, OBJECT_NAME, OBJECT_TYPE FROM T GROUP BY OWNER, OBJECT_NAME, OBJECT_TYPE
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          1          0           0
Execute      5      0.00       0.00          0          0          0           0
Fetch     2475      1.51       1.86          0      13805          0      247425
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total     2481      1.51       1.86          0      13806          0      247425

Rows     Row Source Operation
-------  ---------------------------------------------------
 247425  HASH GROUP BY (cr=13805 pr=0 pw=0 time=1573230 us)
1000720   TABLE ACCESS FULL T (cr=13805 pr=0 pw=0 time=1000937 us)

Chi H, December 07, 2006 - 8:48 pm UTC

I couldn't reproduce this, but found some production data that resembled the following:
select distinct id from t1@dblink
where id = 123;
-- returned one record

select id from t1@dblink
where id = 123;
-- returns 3 records

select *
from t2
where id in (select distinct id from t1@dlink where id = 123);
-- returned 3 records

Like I said, I couldn't create a test case to demonstrate this. Do you have any idea how this could be?

December 08, 2006 - 7:33 am UTC

ops$tkyte%ORA10GR2> create table t1 ( id int );

Table created.

ops$tkyte%ORA10GR2> create table t2 ( id int );

Table created.

ops$tkyte%ORA10GR2>
ops$tkyte%ORA10GR2> insert into t1 select 123 from all_users where rownum <= 3;

3 rows created.

ops$tkyte%ORA10GR2> insert into t2 select 123 from all_users where rownum <= 3;

3 rows created.

ops$tkyte%ORA10GR2>
ops$tkyte%ORA10GR2> select distinct id from t1 where id = 123;

        ID
----------
       123

ops$tkyte%ORA10GR2> select id from t1 where id = 123;

        ID
----------
       123
       123
       123

ops$tkyte%ORA10GR2> select * from t2 where id in (select distinct id from t1 where id = 123);

        ID
----------
       123
       123
       123

<b>the cardinality of a result returned from T2 isn't going to be affected by the cardinality of the subquery against t1 - that is LOSE THE DISTINCT, it is not relevant at all.  

T2 must have three records such that ID=123, that is all - no big mystery really...
</b>

LOSE THE DISTINCT

Matthew, December 08, 2006 - 8:48 am UTC

Or move it to the outermost SELECT if you just want distinct records

Chi H, December 08, 2006 - 2:28 pm UTC

I had tried this test case before my posting yesterday:
SQL> create table t1 (a number, b varchar2(10));

Table created.

SQL> insert into t1 values (1, 'A');

1 row created.

SQL> create table t2 (a number);

Table created.

SQL> insert into t2 values (1)
  2  ;

1 row created.

SQL> insert into t2 values (1);

1 row created.

SQL> commit;

Commit complete.

SQL> select distinct a from t2;

         A
----------
         1

SQL> select * from t1
  2  where a in (select a from t2);

         A B
---------- ----------
         1 A

SQL> select * from t1
  2  where a in (select distinct a from t2);

         A B
---------- ----------
         1 A

I'm feeling a bit too dense.  I don't understand the difference between our results.

December 09, 2006 - 12:41 pm UTC

umm, I selected from t2, not t1 and I had different numbers of rows.

so, not sure where you are going with this.

DISTINCT vs GROUP BY?

Alejandro Daza, December 09, 2006 - 10:13 am UTC

Hey David Aldridge, that test you did is not the same, you have to create the index that Tom´s create.

A conclusion of these tests, it´s that you have to evaluate your queries, ONE by ONE, and you have to make proper indexes to faster up operations like hash. Only with time (expertise (30%)) and the way you perform your architect (or your program data IO structure (70%)), you would prefer to make some operation or another.

My conclusions here are:
1. If you have sort unique, hey I think could be more faster or the same as group by unique
2. If you have hash unique consider a index to cenvert it to sort unique
3. If you can´t index it all, do the group by operation,
BUT REMEMBER TRY ALWAYS TO EVALUATE FIRST

Chi H, December 10, 2006 - 3:06 pm UTC

Tom,
I just took a closer look and see where we are on different pages. My fault for not being clear.

As I mentioned, I created the test case before posting the question, so I just copied and pasted. It just happened that we used the same table names.

My test case shows that t1 has only 1 record and t2 has 2 records. Doing a select from t1 where id in t2 should only give one record. In our production environment, this type of scenario returned multiple records (3 in the production system).

I was just wondering if you could explain it. Your test case is different from the scenario.

December 10, 2006 - 7:41 pm UTC

ok, tell you what - you post the 100% complete, concise, yet 100% here test case - and let us look at it.

far far far too confusing to piece all of the tiny bits together here.

Chi H, December 11, 2006 - 3:09 pm UTC

Tom,
Sorry to have confused you. As I mentioned, I couldn't create a test case, and was hoping you knew something from the back of your mind about this situation.

I think the next best thing is to ping support to have them take a look.

Thanks for the follow-up.

December 11, 2006 - 7:18 pm UTC

nope, need test case - not following your sequence of events in my head - need to see it STEP by STEP

Is this a bug?

A reader, January 10, 2007 - 4:46 pm UTC

SQL> select object_type from dba_objects where owner='SYSTEM' and status='INVALI
D' order by owner;

OBJECT_TYPE
-------------------
PROCEDURE
FUNCTION
PACKAGE BODY
PACKAGE BODY
PROCEDURE
PROCEDURE
PROCEDURE
FUNCTION
PROCEDURE

9 rows selected.

SQL> select distinct object_type from dba_objects where owner='SYSTEM' and statu
s='INVALID' order by owner;

OBJECT_TYPE
-------------------
PROCEDURE
PACKAGE BODY
FUNCTION

SQL> select distinct object_type from dba_objects where owner like 'SYSTEM' and statu
s='INVALID' order by owner

*
ERROR at line 1:
ORA-01791: not a SELECTed expression

SQL>

Oracle 10g r2.

Thanks

please think about all posibilities

deccar, March 26, 2007 - 4:32 am UTC

fools... what will be happen if you use an aggregation function with a group by clause when you really wanted to use distinct? well I'll tell you, your results will be erroneous, cause the function DOES use all the resulting tuples, not only the ones youre seeing.

March 26, 2007 - 7:41 am UTC

brilliant.

absolute hogwash as stated

but well done, thanks so much for your "input"

Please be professional !!

Tom Admirer, March 26, 2007 - 2:37 pm UTC

Tom,

The above review comments do not adhere to the professional standards. People participating here should apply some professional standards and avoid using some coarse or vulgar words...Tom...you should take this seriously and discourage them (as you do it for the people who do the IM talk)...Probably you should consider removing the previous review from deccar.

Well tom, it finally rests in your discretion.

To all other readers and reviewers, please be professional in here...it is not the site to vent your anger...Please express your ideas in a more elegant/proper way.

March 27, 2007 - 9:16 am UTC

moderating is a slippery slope. where does it end.

I don't moderate, but I have no problem making fun :)

So, if we either

o laugh at them
o ignore them

which ever you prefer - they lose, we win.

How would I do that?

daniel, May 03, 2007 - 5:02 pm UTC

create table test_table
(t_id number,
t_name varchar2(10)
);

insert into test_table VALUES (1, 'RAIN')
/
insert into test_table VALUES (2, 'RAIN')
/
insert into test_table VALUES (3, 'RAIN')
/
insert into test_table VALUES (1, 'FOG')
/
insert into test_table VALUES (2, 'FOG')
/
My goal is to select records (all fields) based on the t_name being distinct. I just need to get 1 record for 'RAIN' AND 1 for 'FOG'. Doesn't matter which one. Distinct will look on all the fields and will return 5 rows.

May 04, 2007 - 12:48 pm UTC

http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:228182900346230020

shows three different techniques.

this seems to work

daniel, May 03, 2007 - 5:12 pm UTC

Would this be correct?

select distinct t_name, first_value(t_id) over (partition by t_name)
from test_table

May 04, 2007 - 12:48 pm UTC

no, that returns every row

http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:228182900346230020

shows three techniques

So am I just getting lucky?

daniel, May 04, 2007 - 3:20 pm UTC

SQL> select * from test_table;

T_ID T_NAME
---------- ----------
1 RAIN
3 RAIN
1 FOG
2 FOG

SQL> select distinct t_name, first_value(t_id) over (partition by t_name)
2 from test_table
3 /

T_NAME FIRST_VALUE(T_ID)OVER(PARTITIONBYT_NAME)
---------- ----------------------------------------
FOG 1
RAIN 1 1

May 04, 2007 - 4:06 pm UTC

sorry, did not see the distinct

yes, that will work as well

Distinct on Analytics?

Tom Admirer, May 05, 2007 - 10:06 pm UTC

Tom,

I do not think that you would appreciate the idea of using "DISTINCT" on analytics function. Do you?

So the better idea would be to use any of the three ways you suggested:

select t_id,t_name from
(select a.*,row_number() over (partition by t_name order by t_id) rn from test_table a )
where rn=1


SCOTT @ MYDB> /

      T_ID T_NAME
---------- ----------
         1 FOG
         1 RAIN

Or is there any workaround to avoid the distinct?

May 08, 2007 - 10:22 am UTC

you just demonstrated one way.

I pointed you to three ways.

Probably no difference

Mark Brady, May 07, 2007 - 10:58 am UTC

The Analytic function and the Distinct will both cause a sort - I believe. Did you cost both out? don't just guess if distinct is worse, show that it is.

It's true that I usually tell my developers that if you have to add distinct to make the results correct and you don't know why, it's a missed join causing a cartesian. It's also true that if you use an AF to do normal aggregation you'll get back a lot of duplicate rows that you have to eliminate. (There's another post somewhere on this site where TK does show this point and that the Distinct is ok)

Although I think you tried to simplify your example too much and you made it non-sensical.

May 08, 2007 - 11:07 am UTC

No, the distinct will be in general much worse - the optimizer recognizes top-n quereis with row_number()

T is a copy of all objects:

select *
  from (select t.*, row_number() over (partition by owner order by created) rn
          from t )
 where rn = 1

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          1          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        3      0.20       0.20          3        692          4          30
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        5      0.21       0.21          3        693          4          30

Rows     Row Source Operation
-------  ---------------------------------------------------
     30  VIEW  (cr=692 pr=3 pw=3 time=206306 us)
     68   WINDOW SORT PUSHED RANK (cr=692 pr=3 pw=3 time=206278 us)
  49998    TABLE ACCESS FULL T (cr=692 pr=0 pw=0 time=600043 us)

********************************************************************************
select distinct owner,
first_value( OWNER ) over (partition by owner ),
first_value( OBJECT_NAME ) over (partition by owner ),
first_value( SUBOBJECT_NAME ) over (partition by owner ),
first_value( OBJECT_ID ) over (partition by owner ),
first_value( DATA_OBJECT_ID ) over (partition by owner ),
first_value( OBJECT_TYPE ) over (partition by owner ),
first_value( CREATED ) over (partition by owner ),
first_value( LAST_DDL_TIME ) over (partition by owner ),
first_value( TIMESTAMP ) over (partition by owner ),
first_value( STATUS ) over (partition by owner ),
first_value( TEMPORARY ) over (partition by owner ),
first_value( GENERATED ) over (partition by owner ),
first_value( SECONDARY ) over (partition by owner )
from t

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          1          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        3      0.96       1.93       2713        692         16          30
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        5      0.96       1.94       2713        693         16          30

Rows     Row Source Operation
-------  ---------------------------------------------------
     30  HASH UNIQUE (cr=692 pr=2713 pw=2024 time=1935907 us)
  49998   WINDOW SORT (cr=692 pr=2713 pw=2024 time=2031276 us)
  49998    TABLE ACCESS FULL T (cr=692 pr=0 pw=0 time=350063 us)

orafan, May 09, 2007 - 5:29 pm UTC

I have a table with receipe and ingredient information.
Pleas help me out to write a query to bring all receipes which has 'ING1' and 'ING2' in it .So in this case the result is receipe1 and receipe2.

create table abc ( receipe varchar2(10) ,ingredient varchar2(10))
/

insert into abc values ('Receipe1', 'ING1')
/
insert into abc values ('Receipe1', 'ING2')
/
insert into abc values ('Receipe1', 'ING3')
/
insert into abc values ('Receipe2', 'ING1')
/
insert into abc values ('Receipe2', 'ING3')
/
insert into abc values ('Receipe3', 'ING1')
/
insert into abc values ('Receipe3', 'ING2')
/
insert into abc values ('Receipe3', 'ING10')
/

May 11, 2007 - 10:32 am UTC

smells like homework.

you should be able to come with with at least 5 ways to do this... go for it, on a scale of 1 to 10 sql-wise, this is a 2 in terms of complexity (think GROUP BY and HAVING).

orafan, May 09, 2007 - 10:17 pm UTC

Forgot to maintain that I am looking for a sql solution without using set operation.

May 11, 2007 - 10:33 am UTC

"sql solution without using a set operation"

well, that is going to be HARD :)

given that, well, SELECT is a set operation.

yaz, May 10, 2007 - 3:47 am UTC

select receipe from abc
where receipe in (select receipe from abc where ingredient = 'ING1')
and ingredient = 'ING2';

Analytics Way

A reader, May 11, 2007 - 9:05 pm UTC

select receipe from (
select receipe,count(case when ingredient = 'ING1' then ingredient end) CNT_ING1,
               count(case when ingredient = 'ING2' then ingredient end) CNT_ING2
from abc group by receipe) where CNT_ING1>=1 and CNT_ING2>=1 order by 1

How about this?

May 14, 2007 - 12:54 pm UTC

that is not analytics, that is aggregation

select recipe
  from table
 where recipe in ( 'ING1', 'ING2' )
having count(distinct ingredient) = 2;

would be a "better" way to use aggregation....

Small Correction

A reader, May 14, 2007 - 4:40 pm UTC

Tom,

Small correction in your query, Probably this is what you meant...The logic was good.

select receipe from abc where ingredient in ( 'ING1', 'ING2' ) group by receipe having count(distinct ingredient) = 2;

May 14, 2007 - 5:30 pm UTC

correct, where ingredient in ....

Trick Question.. sort of!

Richard Armstrong-Finnerty, May 16, 2007 - 7:53 am UTC

The question is "a query to bring all receipes which has 'ING1' and 'ING2' in it .So in this case the result is receipe1 and receipe2"... which is impossible, as receipe2 does not have ING2! The 2 receipes (sic) that do have ING1 & ING2 are receipe1 & receipe3.

i love you tom!

dfxgirl, March 26, 2008 - 12:23 pm UTC

yeah that works! Tom is so sexy... uh.. i´m wet...

A reader, April 16, 2008 - 11:38 pm UTC

Different results with Distinct vs. Group By

Dylan, October 31, 2008 - 6:33 pm UTC

While doing some performance turning on a procedure, I came across a case where not only does the performance vary between a statement using distinct vs. group by, but I'm also getting different results. This had me scratching my head a little bit...

EUSER@stdv > set autotrace traceonly
EUSER@stdv > select billing_day
  2  from a_table
  3  group by billing_day;

<b>138 rows selected.

Elapsed: 00:00:00.78</b>

Execution Plan
----------------------------------------------------------

-----------------------------------------------------------------------
| Id  | Operation          | Name             | Rows  | Bytes | Cost  |
-----------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                  |   136 |  1088 |   485 |
|   1 |  SORT GROUP BY     |                  |   136 |  1088 |   485 |
|   2 |   TABLE ACCESS FULL|     A_TABLE      |   617K|  4826K|   105 |
-----------------------------------------------------------------------

Note
-----
   - 'PLAN_TABLE' is old version
   - cpu costing is off (consider enabling it)


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
       2703  consistent gets
       2692  physical reads
        188  redo size
       1913  bytes sent via SQL*Net to client
        300  bytes received via SQL*Net from client
         11  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
        138  rows processed

EUSER@stdv > select distinct billing_day
  2  from a_table;

<b>137 rows selected.

Elapsed: 00:00:23.03</b>

Execution Plan
----------------------------------------------------------

------------------------------------------------------------------------
| Id  | Operation          | Name              | Rows  | Bytes | Cost  |
------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                   |   137 |  1096 | 11407 |
|   1 |  SORT UNIQUE       |                   |   137 |  1096 | 11407 |
|   2 |   TABLE ACCESS FULL|     A_TABLE       |    14M|   111M|  2437 |
------------------------------------------------------------------------

Note
-----
   - 'PLAN_TABLE' is old version
   - cpu costing is off (consider enabling it)


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
      63811  consistent gets
      63633  physical reads
       2464  redo size
       1981  bytes sent via SQL*Net to client
        300  bytes received via SQL*Net from client
         11  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
        137  rows processed

A_TABLE is a relatively small staging table for a data warehouse which contains < 350,000 rows. Since it's a dev environment and I didn't know what condition the table was in, I reorged it and analyzed it just for kicks. That had no impact.

Any thought on what could be causing something like this? We're on 10.2.0.4, but optimizer is still on 9.2.

November 02, 2008 - 4:59 pm UTC

you would need to provide a full example

and of course, if you believe you are getting a wrong result - especially since you obviously have support - you would file it with them - right??

Ignore my last

Dylan, November 03, 2008 - 9:42 am UTC

Figured out what it was. User error after a long week.

An example where 'distinct' and 'group by' are not the same

Jack Douglas, May 02, 2011 - 5:11 am UTC

with w as (select round(level/2) as id from dual connect by level < 11)
select distinct id, count(*) over (partition by id) from w;

with w as (select round(level/2) as id from dual connect by level < 11)
select id, count(*) over (partition by id) from w group by id;

May 04, 2011 - 12:12 pm UTC

yes, true, because analytics are done after the where clause/aggregation takes place...

Question?

chithambaram.p, May 24, 2011 - 11:57 pm UTC

Hi Tom,

How can I get distinct value from the table without using DISTINCT and GROUP BY...Now am using query

select distinct col_name from table1
where col_name>= 0
group by col_name order by col_name

this query is taking long time bcoz the table has 6 crores records..

how can i get good performance plz help me?

May 25, 2011 - 11:26 am UTC

if you have an index on col_name, we can index fast full scan that instead of the table - but distinct is going to be what you use.

There are other ways, but they certainly won't be faster.

@ chithambaram.p

Sokrates, May 25, 2011 - 11:48 am UTC

How can I get distinct value from the table without using DISTINCT and GROUP BY...
query

try UNIQUE

May 25, 2011 - 2:17 pm UTC

;) good one, I should have thought of that - as "select unique" is the same as "select distinct"

Just like "select" is the same as "select all"

not distinct, or unique, or group by but ...

Dave, May 25, 2011 - 10:44 pm UTC

select stuff from mytab
union
select stuff from mytab where 1 = 0;

(sorry :-))

How about this one?

Nathan Marston, May 26, 2011 - 9:56 pm UTC

SELECT stuff
from mytab t1
where not exists
(select null
 from mytab t2
 where t2.rowid < t1.rowid 
 and t2.col1 = t1.col1
 and t2.col2 = t2.col2
 ... repeat for each column in "stuff"
);

The thinking is that wherever you get a "duplicated" row, all the columns will match - but you want one of the duplicate rows (not zero), so the rowid comparison ensures you just get the one with minimum rowid.

I'm not convinced it even works in the face of nullable columns - I'm thinking NULL values will cause it to behave differently from a DISTINCT/GROUP BY.

You could work your way around that using nvl (and a sentinel value for NULL), but ugh... I'd rather not think about that. That SQL's ugly enough as it is :).

May 27, 2011 - 10:31 am UTC

what about nulls indeed.

it won't work with nulls like distinct, unique or group by would.

A reader, May 27, 2011 - 2:51 am UTC

Hi all,
Thanks for your reply...I tried all this way...
but still its very slow...and the column also has index..
the reason is am using this query in PRO*C... and the query is condition based if it is pl/sql then we can create seperate view for this query...but here there is no way to do...

here is my Execution Plan...plz help me to get performance in this...

SQL PLAN REPORT
------------------------------------------------------------------------------
SELECT STATEMENT Cost = 72928
2.1 SORT GROUP BY
3.1 INDEX FAST FULL SCAN COL1_IDX3_NEW INDEX

---------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)|
---------------------------------------------- ------------------------------------------
| 0 | SELECT STATEMENT | | 2 | 108 | | 108K (4)|
|* 1 | FILTER | | | | | |
| 2 | SORT GROUP BY | | 2 | 108 | | 108K (4)|
|* 3 | HASH JOIN | | 18M| 932M| 425M| 105K (2)|
|* 4 | INDEX FAST FULL SCAN| COL1_IDX3_NEW | 13M| 275M| | 28472 (3)|
|* 5 | INDEX FAST FULL SCAN| COL1_IDX3_ NEW | 13M| 400M| | 28472 (3)|
----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter(ROWID=MIN(ROWID))
3 - access("COL1"="A"."COL1")
4 - filter("COL1">0)
5 - filter("COL1">0)

Thanks to All..

by
Chithu...

May 27, 2011 - 10:40 am UTC

I don't know who you are or what you are talking about "reader"

no query to look at, no real context (you cannot be the one from above since they were obviously using a different query, they were using:

select distinct col_name from table1
where col_name>= 0
group by col_name order by col_name

Sambhav, May 28, 2011 - 5:55 am UTC

Hi Tom,

How can we find the sum of each depatment salary without using group by clause.

Regards
Sambhav

May 31, 2011 - 9:45 am UTC

please ask questions in ONE and ONLY ONE place. Look in the other place you asked (and I answered) this same exact question.

answer to sambhav

Mani, May 30, 2011 - 6:06 am UTC

Sambhav, write a plsql procedure where you sum all the salary departmentwise and store it in a temporaray table and do a select from there.

its what is simply re-inventing the wheel :)..any reason as to why group by should not be used??

@Sambhav

A reader, May 30, 2011 - 8:16 am UTC

Sambhav,

may be you can use the analytical function:

<code><i>select distinct * from (select  deptno, sum(sal) over(partition by deptno order by deptno)sum_over_dept from emp);</i>

Data:

DeptNo Sal
10 300
20 200
10 200

Result of query:

DeptNo Sum_Over_Dept
10 500
20 200

</code>

SORT GROUP BY Vs HASH GROUP BY

Rajeshwaran, Jeyabal, June 09, 2011 - 12:12 pm UTC

Tom:

If i remember correct, In the previous release of Oracle, we have SORT ORDER BY access path, if a query has Order by clause.

but in Oracle 10g, I don't see that happening

SORT GROUP BY - if Group by & Order by in query
HASH GROUP BY - if only Group by and Order by is missing
SORT ORDER BY - if only no group by and only order by clause.

rajesh@ORA10GR2> set autotrace traceonly explain;
rajesh@ORA10GR2>
rajesh@ORA10GR2> select owner,object_name,count(*)
  2  from big_table
  3  group by owner,object_name
  4  order by owner,object_name;

48441 rows selected.

Elapsed: 00:00:59.88

Execution Plan
----------------------------------------------------------
Plan hash value: 3184991183

----------------------------------------------------------------------------------------
| Id  | Operation          | Name      | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |           |   437K|    12M|       | 84566   (4)| 00:16:55 |
|   1 |  SORT GROUP BY     |           |   437K|    12M|   505M| 84566   (4)| 00:16:55 |
|   2 |   TABLE ACCESS FULL| BIG_TABLE |    12M|   377M|       | 41518   (2)| 00:08:19 |
----------------------------------------------------------------------------------------

rajesh@ORA10GR2>
rajesh@ORA10GR2> select owner,object_name,count(*)
  2  from big_table
  3  group by owner,object_name;

48441 rows selected.

Elapsed: 00:00:46.22

Execution Plan
----------------------------------------------------------
Plan hash value: 1753714399

----------------------------------------------------------------------------------------
| Id  | Operation          | Name      | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |           |   437K|    12M|       | 84566   (4)| 00:16:55 |
|   1 |  HASH GROUP BY     |           |   437K|    12M|   505M| 84566   (4)| 00:16:55 |
|   2 |   TABLE ACCESS FULL| BIG_TABLE |    12M|   377M|       | 41518   (2)| 00:08:19 |
----------------------------------------------------------------------------------------

rajesh@ORA10GR2> select owner,object_name,cnt
  2  from (
  3  select owner,object_name,count(*) as cnt
  4  from big_table
  5  group by owner,object_name
  6  ) order by owner,object_name;

48441 rows selected.

Elapsed: 00:01:06.73

Execution Plan
----------------------------------------------------------
Plan hash value: 3184991183

----------------------------------------------------------------------------------------
| Id  | Operation          | Name      | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |           |   437K|    12M|       | 84566   (4)| 00:16:55 |
|   1 |  SORT GROUP BY     |           |   437K|    12M|   505M| 84566   (4)| 00:16:55 |
|   2 |   TABLE ACCESS FULL| BIG_TABLE |    12M|   377M|       | 41518   (2)| 00:08:19 |
----------------------------------------------------------------------------------------

rajesh@ORA10GR2>

rajesh@ORA10GR2> select owner,object_name
  2  from big_table
  3  order by owner,object_name
  4  /
Elapsed: 00:00:00.00

Execution Plan
----------------------------------------------------------
Plan hash value: 1472477105

----------------------------------------------------------------------------------------
| Id  | Operation          | Name      | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |           |    12M|   377M|       |   152K  (3)| 00:30:32 |
|   1 |  SORT ORDER BY     |           |    12M|   377M|   977M|   152K  (3)| 00:30:32 |
|   2 |   TABLE ACCESS FULL| BIG_TABLE |    12M|   377M|       | 41518   (2)| 00:08:19 |
----------------------------------------------------------------------------------------

rajesh@ORA10GR2>

Questions

1)What does this SORT GROUP BY & HASH GROUP BY is really about? I am checking in new features guide. but i dont find there

http://download.oracle.com/docs/cd/B19306_01/server.102/b14214/toc.htm

June 09, 2011 - 12:20 pm UTC

I'm very confused by this question.

You say

If i remember correct, In the previous release of Oracle, we have SORT ORDER BY access path, if a query has Order by clause.

but in Oracle 10g, I don't see that happening

SORT GROUP BY - if Group by & Order by in query
HASH GROUP BY - if only Group by and Order by is missing
SORT ORDER BY - if only no group by and only order by clause.

and then proceeded to post and example that contradicts 100% what you just said. We see that in queries with order bys, it tends to use a sort order/group by.

Which is what you just said you didn't see happening?????

and they are not access paths - table access by index rowid, that's an access path. Sort group by/Hash group by is a step in a plan, it isn't used as an access path to data.

Internal Sorts

Snehasish Das, December 14, 2012 - 1:41 am UTC

Hi Tom,
I read your articles about sort in distinct and group by. Also seen the examples where Hash Group by is used in Grouping operations.

Can you please let me know which operations in oracle cause Internal sorting, i.e may use SORT_AREA.
One is order by, other can be UNION/Minus.

Regards,
Snehasish Das

December 17, 2012 - 4:07 pm UTC

Can you please let me know which operations in oracle cause Internal sorting,
i.e may use SORT_AREA.

too numerous to list, if you see "order" in the plan, then you found one. most all of the set operations (union, intersect, minus, distinct...) and many functions (like analytics), join operations (like sort merge joins), etc.

and we don't really use sort_area anymore, we use PGA automatic memory management and that ignores sort_area_size and such.

Ranjan, May 25, 2013 - 3:21 pm UTC

Hi Tom,

I found some strange thing , please have a look on this.

SQL> drop table tt;

Table dropped.

SQL> create table tt(a number,b number);

Table created.

SQL> insert into tt select level,level+1 from dual connect by level <6 union all select 1,2 from dua
l;

6 rows created.

SQL> commit;

Commit complete.

SQL> select * from tt;

         A          B
---------- ----------
         1          2
         2          3
         3          4
         4          5
         5          6
         1          2

6 rows selected.

SQL> select distinct a from tt order by b;
select distinct a from tt order by b
                                   *
ERROR at line 1:
ORA-01791: not a SELECTed expression -- Here as we know, order by would work at the end then why this query is throwing error.(but gone if order by column is in select list).


SQL> select distinct a from tt order by a;

         A
----------
         1
         2
         3
         4
         5

SQL> select distinct a from tt where b=2 order by b;

         A
----------
         1

SQL> select distinct a from tt where b in(2) order by b;

         A
----------
         1

SQL> select distinct a from tt where b>2 order by b;
select distinct a from tt where b>2 order by b
                                             *
ERROR at line 1:
ORA-01791: not a SELECTed expression


SQL> select distinct a from tt where b<3 order by b;
select distinct a from tt where b<3 order by b
                                             *
ERROR at line 1:
ORA-01791: not a SELECTed expression

Here why it is throwing error as data concerned b<3
 is equivalent as b=2(coz only less that b data is 2 here).

SQL> select * from tt;

         A          B
---------- ----------
         1          2
         2          3
         3          4
         4          5
         5          6
         1          2

6 rows selected.

SQL> select distinct a from tt where b=b order by b;
select distinct a from tt where b=b order by b
                                             *
ERROR at line 1:
ORA-01791: not a SELECTed expression
--even got error here .

Could you please give your thoughts for this strange behaviour.

thanks as always.

May 29, 2013 - 5:04 pm UTC

where b=2 order by b;

is the same as:

where b=2;

the order by is known to be not necessary and disappears.

same with in.

anytime the optimizer KNOWS there is only one value for B, it knows the order by is not necessary (you are ordering by a constant at that point! just b=2)

where b=b returns ALL rows in this case, the optimizer knows this and throws the error "I cannot sort by B, B is not selected"

got few and have doubt on few

Ranjan, May 30, 2013 - 6:41 am UTC

Hi Tom,
Thanks for your reply.

Yes that is true b=2 or b in (2) order by b will be disappear(coz it has nothing coz b's values will be 2 anyway).

I didnt understand b=b will return all rows,but I think there is a distinct clause so why it will return all rows and why the optimizer will not able to do order by b.

And could you please tell why it is throughing error in case of b<3.

Thanks as always,
Ranjan

May 30, 2013 - 2:50 pm UTC

you don't understand why "b=b" would return all rows in your case?

You have no nulls.

In ever row, B in that row is equal to B in that row.

So, every row will be returned.

where b=b would only NOT return a row if B were NULL.

when you have b<3, you have the ability to return b=2, b=1, b=0, b=.......... you could be returning all rows in the table with many different values of B. Your data might not allow it right this second, but a simple insert could change that.

IF and ONLY IF the optimizer is able to mathematically prove that B is constant in the result set can it optimize away "b".

cont to my last post

Ranjan, May 30, 2013 - 4:48 pm UTC

thank you for your nice explanation.

I was thinking how distinct + b=b will return all row because there is one duplicate record in my data :) 1 2 and 1 2 .

your last line actually makes everything clear now.

thanks again.

Breadcrumb

Question and Answer

You Asked

and Tom said...

Rating

Comments

Thanks Tom

DISTINCT issues

Direct Answer

Distinct vs Group By

DISTINCT de-mythified by Prometheus

LOSE THE DISTINCT

DISTINCT vs GROUP BY?

Is this a bug?

please think about all posibilities

Please be professional !!

How would I do that?

this seems to work

So am I just getting lucky?

Distinct on Analytics?

Probably no difference

Analytics Way

Small Correction

Trick Question.. sort of!

i love you tom!

Different results with Distinct vs. Group By

Ignore my last

An example where 'distinct' and 'group by' are not the same

Question?

@ chithambaram.p

not distinct, or unique, or group by but ...

How about this one?

answer to sambhav

@Sambhav

SORT GROUP BY Vs HASH GROUP BY

Internal Sorts

got few and have doubt on few

cont to my last post

More to Explore

DBMS_STATS

DISTINCT vs, GROUP BY

Breadcrumb

Question and Answer

You Asked

and Tom said...

Rating

Comments

Thanks Tom

DISTINCT issues

Direct Answer

Distinct vs Group By

DISTINCT de-mythified by Prometheus

LOSE THE DISTINCT

DISTINCT vs GROUP BY?

Is this a bug?

please think about all posibilities

Please be professional !!

How would I do that?

this seems to work

So am I just getting lucky?

Distinct on Analytics?

Probably no difference

Analytics Way

Small Correction

Trick Question.. sort of!

i love you tom!

Different results with Distinct vs. Group By

Ignore my last

An example where 'distinct' and 'group by' are not the same

Question?

@ chithambaram.p

not distinct, or unique, or group by but ...

How about this one?

answer to sambhav

@Sambhav

SORT GROUP BY Vs HASH GROUP BY

Internal Sorts

got few and have doubt on few

cont to my last post

More to Explore

DBMS_STATS