Skip to Main Content

Breadcrumb

Question and Answer

Tom Kyte

Thanks for the question, Dinakar.

Asked: August 06, 2009 - 6:25 am UTC

Last updated: January 31, 2012 - 5:01 pm UTC

Version: Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit

Viewed 10K+ times! This question is

You Asked

I have 2 questions for you TOM:

1. ) when a query has a distinct clause on a clomun and the same column is used in the order by clause, will oracle order the data first and eliminate the duplicate or will it eliminate the duplicate records and then do the order by?
And how is it different if the distinct is on multiple columns?

2.) AND which query is better out of the below:

select a as c from table1
union
select b as c from table2

or

select distinct c from
(
select a as c from table1
union all
select b as c from table2
)

or

select c from
(
select a as c from table1
union
select b as c from table2
) group by c

and Tom said...

1) maybe yes, maybe no, maybe yes for you but no for me.

IT DEPENDS. It is not set in stone. it is entirely up to the optimizer.


2) if you wanted distinct output, A union B is all you needed.

That third query is so wrong. The UNION already "uniqued" the data, the group by would just do it again.

Rating

  (11 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

internal logic of distinct clause

Dinakar Ullas, August 06, 2009 - 10:13 am UTC

Thanks a lot Tom, the third query was supposed to be
select c from
(
select a as c from table1
union all
select b as c from table2
) group by c
Tom Kyte
August 06, 2009 - 10:59 am UTC

it is still not the right way. You meant to say A UNION B... that is what you want.

internal logic of distinct clause

Dinakar Ullas, August 07, 2009 - 2:27 am UTC

Tom, i still wonder what might be the internal algorithm behind a distinct clause. i feel a distinct clause along with an order by should work faster than a distinct clause without order by. Lets see this example:
a.) lets assume after an order by data looks like this
1
1
2
3
3
3
4
and now if oracle applies distinct on the already ordered data, it should take the first record 1 and go to next record which is again 1 and stop (bec next record is 2 which does not match) and then delete the second occurance of 1.next take 2 and find the next record which is 3..... This ways its easy to delete duplicates.

b.) if suppose the data is not ordered as below

1
2
3
4
1
2
2
here, it has to take the first record 1 and match it with all six records below and then identify the duplicate. The same case with 2, it has to start matching from the first record 1 and move down. This is cumbersome.......
Tom Kyte
August 07, 2009 - 9:42 am UTC

... i still wonder what might be the internal algorithm behind a distinct
clause. i feel a distinct clause along with an order by should work faster than
a distinct clause without order by. ...

why? You don't have to SORT to DISTINCT.


... This ways its easy to
delete duplicates.
....

not all of the time. Sometimes YES, sometimes NO.


We can distinct using sorting
We can distinct using hashing


ops$tkyte%ORA10GR2> set autotrace on
ops$tkyte%ORA10GR2> select distinct job from scott.emp;

JOB
---------
CLERK
SALESMAN
PRESIDENT
MANAGER
ANALYST


Execution Plan
----------------------------------------------------------
Plan hash value: 3709190377

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     5 |    40 |     3  (34)| 00:00:01 |
|   1 |  HASH UNIQUE       |      |     5 |    40 |     3  (34)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| EMP  |    14 |   112 |     2   (0)| 00:00:01 |
---------------------------------------------------------------------------



why bother sorting the data at all - that would be an unnecessary step. However, if you ask for the data to be ordered:

ops$tkyte%ORA10GR2> select distinct job from scott.emp order by 1;

JOB
---------
ANALYST
CLERK
MANAGER
PRESIDENT
SALESMAN


Execution Plan
----------------------------------------------------------
Plan hash value: 596748738

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     5 |    40 |     4  (50)| 00:00:01 |
|   1 |  SORT UNIQUE       |      |     5 |    40 |     3  (34)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| EMP  |    14 |   112 |     2   (0)| 00:00:01 |
---------------------------------------------------------------------------



then we will do the two operations in a single pass - we'll sort and distinct at the same time - but since sorting is very expensive - if we don't have to sort - we won't.

Does DISTINCT cause Oracle to do an implicit SORT?

Ben, August 28, 2009 - 1:28 pm UTC

Tom,

I was told on an Oracle Performance training course (by the instructor) that the SQL clauses DISTINCT and GROUP BY both caused an implicit sort of the dataset. Is that really the case?

Thanks.
Tom Kyte
August 28, 2009 - 5:24 pm UTC

no, it has never had to be true that either did group by.

NEVER

if you need data sorted - you need to use ORDER BY, if the optimizer can skip doing a sort of the data by including it in some other step - fine, but YOU cannot skip putting the order by there.


http://asktom.oracle.com/Misc/order-in-court.html



what the trainer should have said is "Oracle is smart enough to skip ordering the data if the prior rowsource is already returning the data sorted, but you must use order by in all cases if you want it sorted. The optimizer will remove it if it can - but you cannot"

Vocabulary issue

Nicosa, August 31, 2009 - 6:42 am UTC

Just my 2 cents here :

I think most of those misunderstandings comes from vocabulary : To sort doesn't imply to give an ordering to the output.

Most of the time when sort is done on numbers, dates or strings, we expect the output to be ordered because those datatypes have a natural order.

But what if you had to sort by color (blue, orange, etc...) or by shape (square, rectangular, triangular, circle) ?
(Imagine Oracle had a native COLOR or SHAPE datatype)
Those do not really have a "natural ordering".

What if I give you mixed colors jelly beans and ask you to sort them by colour ?
Why would you order when I only asked to get every blue bean with other blue beans, and every yellow with yellows, etc...

I think the important is to understand that sorting is NOT ordering (hence the two different words).
Tom Kyte
August 31, 2009 - 7:45 am UTC

... Most of the time when sort is done on numbers, dates or strings, we expect the
output to be ordered because those datatypes have a natural order.
...

actually - no. Consider that here in the US - we use all of 7bits to store our string data. In France, you use 8bits. Now, in the US a string sorted in binary fashion (what a group by does if it orders) would 'sort' strings correctly - since all of our letters are ascii 127 and under. The strings in France however, if sorted in binary fashion, would be *wrong*, not sorted - because you need a character set sort - so that some characters with codes between 128 and 255 sort below other characters in the 0..127 range.


so, the sort that might be done by a group by or distinct would not ORDER the data (we do a binary sort for those operations, order by uses a character set ordering to sort data)


but yes, I see your point in the differentiation - that is a good way of putting it - but use the string example instead of colors - it is more "natural" (everyone has a frame of reference for it) and we don't really order by color in the database - but we do order strings.

Sure, but sorting is not only computer/database related

Nicosa, August 31, 2009 - 9:05 am UTC

actually - no. Consider that here in the US - we use all of 7bits to store our string data. In France, you use 8bits...

Being a true fan of askTom, I know all that.
I was speaking at larger point of view (not only database/computer sorting).
Same for the natural order, it was meant "natural order relative to each".

Operation's contract and implementation detail shall be distinguished

Oleksandr Alesinskyy, August 31, 2009 - 10:10 am UTC

How distinct is implemented is just implementation detail - it may change with the next version (or next patchset, or next patch, ...). It is completely, 100% internal thing that never should be revealed or relied on.
Tom Kyte
August 31, 2009 - 1:13 pm UTC

and it did change dramatically :)

it rarely returns data "sorted" by anything.

And it never had to, it never "always did a sort", there were many ways to distinct data from the very beginning.

Oleksandr Alesinskyy, August 31, 2009 - 5:18 pm UTC

If I remember correctly in very old versions of Oracle (in which subquery in the FROM clause was permitted, but ORDER BY clause in such subqueries was prohibited - I guess they were versions from 7.1.6 to 7.3.2 (not completely sure, but something like this) DISTINCT was practically officially recommended way to obtain sorted result from the subquery (e.g. for Top-N task). At least I can clearly remember an article in Oracle magazine with such recommendation (but as far I can remember the caveat "it may change in the future" was present!).
Tom Kyte
August 31, 2009 - 6:24 pm UTC

...DISTINCT was practically officially recommended ....

never in any documentation I've ever read

distinct

a) never had to sort
b) sorted binary
c) if the subquery contained more than one column - what would the set be ordered by exactly??

... clearly remember an article in Oracle magazine. ...

Ahh, reminds me of how http://asktom.oracle.com/Misc/birth-of-asktom.html asktom got started in the first place, this is all the result of a "column in that magazine"</a>

... but as far I can remember the caveat "it may change in the future" was present! ...

it had already changed! before 7.1. distinct *never had to sort*, it might have, but it did not have to and a plan change could change the output at any time.



Its natural for humans to think a sort is needed

Galen Boyer, September 01, 2009 - 10:36 am UTC

I think people believe the group by or distinct clauses would cause an internal sort, because this is how we would solve things, or on the surface it looks that way. Give me a deck of cards and tell me if I have 13 of each suit. I'd probably order them by suit and then count each suit (or maybe, while I'm ordering into suits, count the cards as well so if not 52, the answer is immediately no). So, of course, a machine must think that way. But, now, give me 100 decks of cards all together and ask me if I have 1300 cards per suit. I'd probably say, give me a pencil and a sheet of paper as well as 1 other guy. The first guy takes the pencil and writes 4 lines, each starting with the suit. Then I just flip over each card calling out the suit. The other guy puts a mark in whatever suit slot (well, maybe keeps incrementing the number cause I'd rather have him use excel and sum a column or something). So, in the end, I never had to sort anything to get the answer and I truly would not have wanted to because the amount of work for sorting would have been large and it would not have gotten me much closer to my answer anyways. I would guess the latter is probably how Oracle handles group by.
Tom Kyte
September 01, 2009 - 11:47 am UTC

... I'd
probably order them by suit and then count each suit ...

Actually, good example. How do you "order by suit", which comes first?

We hash them - if the first card is a heart, we start a heart pile and then a clubs and then a spade and then a diamond. Maybe next time is it clubs, spades, diamonds hearts. And then we hash the rest of the deck.

We do that all of the time to separate things, to group them. We rarely "sort" them.


today - given fast cpu and large memory, we tend to use hashing. In the past, given slow cpu and small memory - we used sorting and paging.

...DISTINCT was practically officially recommended ....

Oleksandr Alesinskyy, September 01, 2009 - 7:06 pm UTC

>>...DISTINCT was practically officially recommended ....

>never in any documentation I've ever read

In documentation I as well have not seen, as far as I can remember. But I would gladly take a look on "Application Developer Guide" (or how it was named at that time) from 7.1.6 or 7.2 (if they available somewhere?). It is possible (just possible, with low probability) that the trick was present there in one of the samples.

Not that I say that it was true at any point in time :) I just said that it was often recommended by "experts" - including Oracle own magazine (and it was an article, not just user supplied tip) - at that time as a way to sort results of the subselect. BTW, hash joins has not existed at that time, they were added later (if I remember properly). It does not necessarily means that hashing was not used for DISTINCT, but ...
>a) never had to sort
Sure it never was part of its specification.

>b) sorted binary
Not important for numeric columns (Top-N).

>c) if the subquery contained more than one column
Sure, it is completely undefined.


>>... clearly remember an article in Oracle magazine. ...

>Ahh, reminds me of how

Yes, I remember this dirty trick with direct update of COL$.
Have done it once (just from curiosity and on the test installation).


Tom Kyte
September 02, 2009 - 10:28 am UTC

but I don't even see how it would be

select distinct a,b,c from t

how would that have returned the data - sorted by what?


...
>b) sorted binary
Not important for numeric columns (Top-N).
.....

how could you do a top-n since you didn't know "how" it would be sorted - multiple columns, how would the sorted data be returned???

... just said that it was
often recommended by "experts" - including Oracle own magazine (and it was an
article, not just user supplied tip) - at that time as a way to sort results of
the subselect. BTW, hash joins has n...

and they were mistaken and I would have shouted as loud as possible at them to "stop it". I cannot find it (google destroyed the newsgroup archives when they bought dejanews - searching back in time is useless) but I remember quite a heated discussion with a guy named 'keystroke' in the mid 1990's. the gist was:


poster asks: how to create a view with an order by
tom said: you cannot (it was not a capability at that time)
keystroke says: when people say you cannot it means they do not know how, you use group by and group by will sort the data

tom said: bzzzztttt - totally wrong, here are a couple of examples in version 7.0 showing group by does not sort. Simply adding an index for example would change the sort order. simply changing your nls_settings would change the sort order (give the wrong sort order and so on)


because it was in an Oracle Magazine article (which are mostly written by other people, especially back then, not Oracle itself - and reviewed by copy editors - not tech editors necessarily (that is why I got involved in the first place))



... BTW, hash joins has not existed at that time, they were added
later (if I remember properly). ...

yes, that was a version 7.3 addition - the hash join was.

but even before hashing - there were indexes, unique indexes in particular and they would return data sorted in the order of the indexed columns which might not be the same as the projected columns



DISTINCT in Oracle 8.1.6 documentation

Oleksandr Alesinskyy, September 01, 2009 - 7:52 pm UTC

See "Oracle8i Designing and Tuning for Performance
Release 2 (8.1.6)" available at http://ugrafmsh.ru/japan/Our_Resources/Doc_Oracle8i/server.816/a76992/sql.htm

There is the sentence "Minimize the use of DISTINCT. DISTINCT always creates a sort; all the data must be instantiated before your results can be returned.".

What is even more funny - it is located in the "Use Care When Embedding Data Value Lists in Applications" section,
so it is almost definitely a leftover from sloppy editing of the older version of this book.

Ant it is so - see "Oracle8 Tuning, Release 8.0" at download.oracle.com, namely, http://download.oracle.com/docs/cd/A64702_01/doc/server.805/a58246.pdf .

And here it as well is leftover (for the same reasons) - but I'm unable to trace it any further.

So such a statement was in the documentation - BTW, for 8.1.6 it was false above any doubt.


QUESTION

marian, January 27, 2012 - 1:38 pm UTC

Hello,

Could you explain me why the first query is ok but the second one is wrong.

SQL> drop table t;

Table dropped.

SQL>
SQL> create table t as select rownum x,lpad(rownum,100,'0') y from dual connect by level <= 10000;

Table created.

SQL>

SQL> select distinct x || '-' || y from (select * from t) where x = 1 order by x;
1-0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001

SQL>

SQL> select distinct x || '-' || y from (select * from t where x = 1) order by x;
ERROR at line 1:
ORA-01791: not a SELECTed expression

Tom Kyte
January 31, 2012 - 5:01 pm UTC

ops$tkyte%ORA11GR2> create table t as select rownum x, lpad(rownum,100,'0') y from dual;

Table created.

ops$tkyte%ORA11GR2> select distinct x || '-' || y from (select * from t) where x = 1 order by x;

X||'-'||Y
----------------------------------------------------------------------------------------------------
1-00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
01


ops$tkyte%ORA11GR2>  select distinct x || '-' || y from (select * from t where x = 1) order by x;

X||'-'||Y
----------------------------------------------------------------------------------------------------
1-00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
01



care to mention a version? I cannot reproduce

More to Explore

Performance

Get all the information about database performance in the Database Performance guide.