internal logic of distinct clause
Dinakar Ullas, August 06, 2009 - 10:13 am UTC
Thanks a lot Tom, the third query was supposed to be
select c from
(
select a as c from table1
union all
select b as c from table2
) group by c
August 06, 2009 - 10:59 am UTC
it is still not the right way. You meant to say A UNION B... that is what you want.
internal logic of distinct clause
Dinakar Ullas, August 07, 2009 - 2:27 am UTC
Tom, i still wonder what might be the internal algorithm behind a distinct clause. i feel a distinct clause along with an order by should work faster than a distinct clause without order by. Lets see this example:
a.) lets assume after an order by data looks like this
1
1
2
3
3
3
4
and now if oracle applies distinct on the already ordered data, it should take the first record 1 and go to next record which is again 1 and stop (bec next record is 2 which does not match) and then delete the second occurance of 1.next take 2 and find the next record which is 3..... This ways its easy to delete duplicates.
b.) if suppose the data is not ordered as below
1
2
3
4
1
2
2
here, it has to take the first record 1 and match it with all six records below and then identify the duplicate. The same case with 2, it has to start matching from the first record 1 and move down. This is cumbersome.......
August 07, 2009 - 9:42 am UTC
... i still wonder what might be the internal algorithm behind a distinct
clause. i feel a distinct clause along with an order by should work faster than
a distinct clause without order by. ...
why? You don't have to SORT to DISTINCT.
... This ways its easy to
delete duplicates.
....
not all of the time. Sometimes YES, sometimes NO.
We can distinct using sorting
We can distinct using hashing
ops$tkyte%ORA10GR2> set autotrace on
ops$tkyte%ORA10GR2> select distinct job from scott.emp;
JOB
---------
CLERK
SALESMAN
PRESIDENT
MANAGER
ANALYST
Execution Plan
----------------------------------------------------------
Plan hash value: 3709190377
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 5 | 40 | 3 (34)| 00:00:01 |
| 1 | HASH UNIQUE | | 5 | 40 | 3 (34)| 00:00:01 |
| 2 | TABLE ACCESS FULL| EMP | 14 | 112 | 2 (0)| 00:00:01 |
---------------------------------------------------------------------------
why bother sorting the data at all - that would be an unnecessary step. However, if you ask for the data to be ordered:
ops$tkyte%ORA10GR2> select distinct job from scott.emp order by 1;
JOB
---------
ANALYST
CLERK
MANAGER
PRESIDENT
SALESMAN
Execution Plan
----------------------------------------------------------
Plan hash value: 596748738
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 5 | 40 | 4 (50)| 00:00:01 |
| 1 | SORT UNIQUE | | 5 | 40 | 3 (34)| 00:00:01 |
| 2 | TABLE ACCESS FULL| EMP | 14 | 112 | 2 (0)| 00:00:01 |
---------------------------------------------------------------------------
then we will do the two operations in a single pass - we'll sort and distinct at the same time - but since sorting is very expensive - if we don't have to sort - we won't.
Does DISTINCT cause Oracle to do an implicit SORT?
Ben, August 28, 2009 - 1:28 pm UTC
Tom,
I was told on an Oracle Performance training course (by the instructor) that the SQL clauses DISTINCT and GROUP BY both caused an implicit sort of the dataset. Is that really the case?
Thanks.
August 28, 2009 - 5:24 pm UTC
no, it has never had to be true that either did group by.
NEVER
if you need data sorted - you need to use ORDER BY, if the optimizer can skip doing a sort of the data by including it in some other step - fine, but YOU cannot skip putting the order by there.
http://asktom.oracle.com/Misc/order-in-court.html what the trainer should have said is "Oracle is smart enough to skip ordering the data if the prior rowsource is already returning the data sorted, but you must use order by in all cases if you want it sorted. The optimizer will remove it if it can - but you cannot"
Vocabulary issue
Nicosa, August 31, 2009 - 6:42 am UTC
Just my 2 cents here :
I think most of those misunderstandings comes from vocabulary : To sort doesn't imply to give an ordering to the output.
Most of the time when sort is done on numbers, dates or strings, we expect the output to be ordered because those datatypes have a natural order.
But what if you had to sort by color (blue, orange, etc...) or by shape (square, rectangular, triangular, circle) ?
(Imagine Oracle had a native COLOR or SHAPE datatype)
Those do not really have a "natural ordering".
What if I give you mixed colors jelly beans and ask you to sort them by colour ?
Why would you order when I only asked to get every blue bean with other blue beans, and every yellow with yellows, etc...
I think the important is to understand that sorting is NOT ordering (hence the two different words).
August 31, 2009 - 7:45 am UTC
... Most of the time when sort is done on numbers, dates or strings, we expect the
output to be ordered because those datatypes have a natural order.
...
actually - no. Consider that here in the US - we use all of 7bits to store our string data. In France, you use 8bits. Now, in the US a string sorted in binary fashion (what a group by does if it orders) would 'sort' strings correctly - since all of our letters are ascii 127 and under. The strings in France however, if sorted in binary fashion, would be *wrong*, not sorted - because you need a character set sort - so that some characters with codes between 128 and 255 sort below other characters in the 0..127 range.
so, the sort that might be done by a group by or distinct would not ORDER the data (we do a binary sort for those operations, order by uses a character set ordering to sort data)
but yes, I see your point in the differentiation - that is a good way of putting it - but use the string example instead of colors - it is more "natural" (everyone has a frame of reference for it) and we don't really order by color in the database - but we do order strings.
Sure, but sorting is not only computer/database related
Nicosa, August 31, 2009 - 9:05 am UTC
actually - no. Consider that here in the US - we use all of 7bits to store our string data. In France, you use 8bits...
Being a true fan of askTom, I know all that.
I was speaking at larger point of view (not only database/computer sorting).
Same for the natural order, it was meant "natural order relative to each".
Operation's contract and implementation detail shall be distinguished
Oleksandr Alesinskyy, August 31, 2009 - 10:10 am UTC
How distinct is implemented is just implementation detail - it may change with the next version (or next patchset, or next patch, ...). It is completely, 100% internal thing that never should be revealed or relied on.
August 31, 2009 - 1:13 pm UTC
and it did change dramatically :)
it rarely returns data "sorted" by anything.
And it never had to, it never "always did a sort", there were many ways to distinct data from the very beginning.
Oleksandr Alesinskyy, August 31, 2009 - 5:18 pm UTC
If I remember correctly in very old versions of Oracle (in which subquery in the FROM clause was permitted, but ORDER BY clause in such subqueries was prohibited - I guess they were versions from 7.1.6 to 7.3.2 (not completely sure, but something like this) DISTINCT was practically officially recommended way to obtain sorted result from the subquery (e.g. for Top-N task). At least I can clearly remember an article in Oracle magazine with such recommendation (but as far I can remember the caveat "it may change in the future" was present!).
August 31, 2009 - 6:24 pm UTC
...DISTINCT was practically officially recommended ....
never in any documentation I've ever read
distinct
a) never had to sort
b) sorted binary
c) if the subquery contained more than one column - what would the set be ordered by exactly??
... clearly remember an article in Oracle magazine. ...
Ahh, reminds me of how
http://asktom.oracle.com/Misc/birth-of-asktom.html asktom got started in the first place, this is all the result of a "column in that magazine"</a>
... but as far I can remember the caveat "it may change in the future" was present! ...
it had already changed! before 7.1. distinct *never had to sort*, it might have, but it did not have to and a plan change could change the output at any time.
Its natural for humans to think a sort is needed
Galen Boyer, September 01, 2009 - 10:36 am UTC
I think people believe the group by or distinct clauses would cause an internal sort, because this is how we would solve things, or on the surface it looks that way. Give me a deck of cards and tell me if I have 13 of each suit. I'd probably order them by suit and then count each suit (or maybe, while I'm ordering into suits, count the cards as well so if not 52, the answer is immediately no). So, of course, a machine must think that way. But, now, give me 100 decks of cards all together and ask me if I have 1300 cards per suit. I'd probably say, give me a pencil and a sheet of paper as well as 1 other guy. The first guy takes the pencil and writes 4 lines, each starting with the suit. Then I just flip over each card calling out the suit. The other guy puts a mark in whatever suit slot (well, maybe keeps incrementing the number cause I'd rather have him use excel and sum a column or something). So, in the end, I never had to sort anything to get the answer and I truly would not have wanted to because the amount of work for sorting would have been large and it would not have gotten me much closer to my answer anyways. I would guess the latter is probably how Oracle handles group by.
September 01, 2009 - 11:47 am UTC
... I'd
probably order them by suit and then count each suit ...
Actually, good example. How do you "order by suit", which comes first?
We hash them - if the first card is a heart, we start a heart pile and then a clubs and then a spade and then a diamond. Maybe next time is it clubs, spades, diamonds hearts. And then we hash the rest of the deck.
We do that all of the time to separate things, to group them. We rarely "sort" them.
today - given fast cpu and large memory, we tend to use hashing. In the past, given slow cpu and small memory - we used sorting and paging.
...DISTINCT was practically officially recommended ....
Oleksandr Alesinskyy, September 01, 2009 - 7:06 pm UTC
>>...DISTINCT was practically officially recommended ....
>never in any documentation I've ever read
In documentation I as well have not seen, as far as I can remember. But I would gladly take a look on "Application Developer Guide" (or how it was named at that time) from 7.1.6 or 7.2 (if they available somewhere?). It is possible (just possible, with low probability) that the trick was present there in one of the samples.
Not that I say that it was true at any point in time :) I just said that it was often recommended by "experts" - including Oracle own magazine (and it was an article, not just user supplied tip) - at that time as a way to sort results of the subselect. BTW, hash joins has not existed at that time, they were added later (if I remember properly). It does not necessarily means that hashing was not used for DISTINCT, but ...
>a) never had to sort
Sure it never was part of its specification.
>b) sorted binary
Not important for numeric columns (Top-N).
>c) if the subquery contained more than one column
Sure, it is completely undefined.
>>... clearly remember an article in Oracle magazine. ...
>Ahh, reminds me of how
Yes, I remember this dirty trick with direct update of COL$.
Have done it once (just from curiosity and on the test installation).
September 02, 2009 - 10:28 am UTC
but I don't even see how it would be
select distinct a,b,c from t
how would that have returned the data - sorted by what?
...
>b) sorted binary
Not important for numeric columns (Top-N).
.....
how could you do a top-n since you didn't know "how" it would be sorted - multiple columns, how would the sorted data be returned???
... just said that it was
often recommended by "experts" - including Oracle own magazine (and it was an
article, not just user supplied tip) - at that time as a way to sort results of
the subselect. BTW, hash joins has n...
and they were mistaken and I would have shouted as loud as possible at them to "stop it". I cannot find it (google destroyed the newsgroup archives when they bought dejanews - searching back in time is useless) but I remember quite a heated discussion with a guy named 'keystroke' in the mid 1990's. the gist was:
poster asks: how to create a view with an order by
tom said: you cannot (it was not a capability at that time)
keystroke says: when people say you cannot it means they do not know how, you use group by and group by will sort the data
tom said: bzzzztttt - totally wrong, here are a couple of examples in version 7.0 showing group by does not sort. Simply adding an index for example would change the sort order. simply changing your nls_settings would change the sort order (give the wrong sort order and so on)
because it was in an Oracle Magazine article (which are mostly written by other people, especially back then, not Oracle itself - and reviewed by copy editors - not tech editors necessarily (that is why I got involved in the first place))
... BTW, hash joins has not existed at that time, they were added
later (if I remember properly). ...
yes, that was a version 7.3 addition - the hash join was.
but even before hashing - there were indexes, unique indexes in particular and they would return data sorted in the order of the indexed columns which might not be the same as the projected columns
DISTINCT in Oracle 8.1.6 documentation
Oleksandr Alesinskyy, September 01, 2009 - 7:52 pm UTC
See "Oracle8i Designing and Tuning for Performance
Release 2 (8.1.6)" available at
http://ugrafmsh.ru/japan/Our_Resources/Doc_Oracle8i/server.816/a76992/sql.htm There is the sentence
"Minimize the use of DISTINCT. DISTINCT always creates a sort; all the data must be instantiated before your results can be returned.".
What is even more funny - it is located in the "Use Care When Embedding Data Value Lists in Applications" section,
so it is almost definitely a leftover from sloppy editing of the older version of this book.
Ant it is so - see "Oracle8 Tuning, Release 8.0" at download.oracle.com, namely,
http://download.oracle.com/docs/cd/A64702_01/doc/server.805/a58246.pdf .
And here it as well is leftover (for the same reasons) - but I'm unable to trace it any further.
So such a statement was in the documentation - BTW, for 8.1.6 it was false above any doubt.
QUESTION
marian, January 27, 2012 - 1:38 pm UTC
Hello,
Could you explain me why the first query is ok but the second one is wrong.
SQL> drop table t;
Table dropped.
SQL>
SQL> create table t as select rownum x,lpad(rownum,100,'0') y from dual connect by level <= 10000;
Table created.
SQL>
SQL> select distinct x || '-' || y from (select * from t) where x = 1 order by x;
1-0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001
SQL>
SQL> select distinct x || '-' || y from (select * from t where x = 1) order by x;
ERROR at line 1:
ORA-01791: not a SELECTed expression
January 31, 2012 - 5:01 pm UTC
ops$tkyte%ORA11GR2> create table t as select rownum x, lpad(rownum,100,'0') y from dual;
Table created.
ops$tkyte%ORA11GR2> select distinct x || '-' || y from (select * from t) where x = 1 order by x;
X||'-'||Y
----------------------------------------------------------------------------------------------------
1-00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
01
ops$tkyte%ORA11GR2> select distinct x || '-' || y from (select * from t where x = 1) order by x;
X||'-'||Y
----------------------------------------------------------------------------------------------------
1-00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
01
care to mention a version? I cannot reproduce