Thanks for the question, Sokrates.

Asked: February 23, 2009 - 3:31 am UTC

Last updated: September 08, 2011 - 5:04 pm UTC

Version: 10.2.0

Viewed 10K+ times! This question is

You Asked

I never use "not exists"
The reason is: everytime I wanted to use it, I noticed, logically the same query can be achieved by using an outer join and then adding a condition "and <not null column (of the outer joined table/view/subquery)> is null".
(this only holds for not "more trivial not exists" where "not exists" can even be replaced by minus )
And everytime I replaced a "not exists" clause with this technique, I noticed the execution plan was "better" (fewer physical/logical reads, improved execution time).
So, my question:
are there really cases where you need "not exists" ?

and Tom said...

You use the word "everytime" which is "very strong"

I disagree, here is my counter case, it proves that "everytime" is wrong in your context. If two queries result in the same exact plan, they'll do the same exact work.

ops$tkyte%ORA10GR2> set autotrace traceonly explain
ops$tkyte%ORA10GR2> select * from scott.emp where not exists (select null from scott.dept where dept.deptno = emp.deptno);

Execution Plan
----------------------------------------------------------
Plan hash value: 3496123964

------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 40 | 3 (0)| 00:00:01 |
| 1 | NESTED LOOPS ANTI | | 1 | 40 | 3 (0)| 00:00:01 |
| 2 | TABLE ACCESS FULL| EMP | 14 | 518 | 3 (0)| 00:00:01 |
|* 3 | INDEX UNIQUE SCAN| PK_DEPT | 4 | 12 | 0 (0)| 00:00:01 |
------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

3 - access("DEPT"."DEPTNO"="EMP"."DEPTNO")

ops$tkyte%ORA10GR2> select emp.* from scott.emp, scott.dept where emp.deptno = dept.deptno(+) and dept.deptno is null;

Execution Plan
----------------------------------------------------------
Plan hash value: 3496123964

------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 40 | 3 (0)| 00:00:01 |
| 1 | NESTED LOOPS ANTI | | 1 | 40 | 3 (0)| 00:00:01 |
| 2 | TABLE ACCESS FULL| EMP | 14 | 518 | 3 (0)| 00:00:01 |
|* 3 | INDEX UNIQUE SCAN| PK_DEPT | 4 | 12 | 0 (0)| 00:00:01 |
------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

3 - access("EMP"."DEPTNO"="DEPT"."DEPTNO")

ops$tkyte%ORA10GR2> set autotrace off

You'd need to give me a concrete case to work with - the optimizer knows that not exists is an anti-join and will do the right thing.

The optimizer knows they are equivalent.

Me, I use NOT EXISTS over this fake outer join trick as it semantically says what I mean, whereas the outer join trick just makes people scratch their heads later trying to figure out what it is.

Let's discuss - but please people - don't just say "one time I replaced a not exists with outer join and it was better" - because all I'll say is "one time I replaced an outer join with a not exists and it was better" - demonstrate the context you were working under.

Rating

(24 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

don't just say "one time I replaced a not exists with outer join and it was better"

Duke Ganote, February 23, 2009 - 12:50 pm UTC

OK, I won't. I admit use of the poor man's anti-join so I don't have to remember the distinction between NOT IN and NOT EXISTS and whether to add a NOT NULL predicate to NOT IN. It's all so kNOTty.

http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:442029737684

it depends

Sokrates, February 23, 2009 - 1:16 pm UTC

q<"
...
don't just say "one time I replaced a not exists with outer join and it was better" - because all I'll say is "one time I replaced an outer join with a not exists and it was better"
...
">

ok, it all depends, you are right.
I had an issue yesterday with a quite complex view and "not exists" (too complex to post here, nobody will be interested).

replacing "not exists" into an outer join - as I was used to - resulted in a completely other execution plan and brought down execution time from 12 minutes to 25 seconds.

Now I played around, trying to make a simple example - didn't succeed.
However, I saw (it seems to me that):
when query is a bit more complex than your example, execution plan will most likely change between "not exists" and "outer join", and sometimes the one seems to be superior, sometimes the other.

Learnt that I will take "not exists" into account more often, because it is absolutely more easy to read and to understand

February 23, 2009 - 5:46 pm UTC

... and sometimes the one
seems to be superior, sometimes the other.
...

and it will always come down to "one let us guess the estimated card= value better - get the right card= and we get the right plan, get the wrong one - and we don't"

So, that is one reason why we might play with dynamic sampling (level 3 - to fill in the blanks for guessing) or sql profiles before sitting down and rewriting SQL (or extended statistics, statistics on functions/virtual columns, statistics on column pairs and such available in 11g)

OBJECTION to "outer join trick just makes people scratch their heads"

Duke Ganote, February 24, 2009 - 10:48 am UTC

The outer join approach isn't a "trick" at all. I visualize a Venn diagram where:
A - ( A ^ B )
or:
"Set A minus the intersect of A and B"

http://en.wikipedia.org/wiki/Venn_Diagram#A_simple_example

In SQL:

select emp.*
from scott.emp A LEFT OUTER JOIN
scott.dept B -- all of set A while identifying ( A ^ B )
ON ( A.deptno = B.deptno )
WHERE B.deptno is null -- MINUS ( A ^ B )

The "NOT EXISTS" approach is logically equivalent.

SELECT A.*
FROM scott.emp A -- all of set A
WHERE NOT EXISTS -- MINUS ( A ^ B )
(SELECT B.deptno
FROM scott.dept B
where B.deptno = A.deptno)

The "NOT IN" approach is simply "A - B".

SELECT A.*
FROM scott.emp A
WHERE A.deptno NOT IN
(SELECT B.deptno
FROM scott.dept B
where B.deptno IS NOT NULL)

February 24, 2009 - 5:05 pm UTC

I'd guess that 90% of 'sql coders' aren't sure at all what an outer join really is actually - many people stick them in "just in case".

select emp.*
from scott.emp A LEFT OUTER JOIN
scott.dept B -- all of set A while identifying ( A ^ B )
ON ( A.deptno = B.deptno )
WHERE B.deptno is null -- MINUS ( A ^ B )

that to me, reading it, says "join emp to dept"

this:

SELECT A.*
FROM scott.emp A -- all of set A
WHERE NOT EXISTS -- MINUS ( A ^ B )
(SELECT B.deptno
FROM scott.dept B
where B.deptno = A.deptno)

says "read emp and give me everything from emp such that there is no match in dept"

The not exists says exactly what you are trying say - what the question is, the outer join - it does not. My mind would not say "in order to get things in emp that are not in dept, I start by joining emp to dept" - it doesn't work that way.

beyond the 90%

Duke Ganote, February 24, 2009 - 5:49 pm UTC

It's "more obvious" to me, because the intersect ( A ^ B ) is an inner join:
select emp.*
from scott.emp A INNER JOIN
scott.dept B -- ( A ^ B )
ON ( A.deptno = B.deptno )

Getting all of A means using the left outer join, then "throwing away" the intersect.

But I've never denied that I was born with a high quota of quirks!

February 24, 2009 - 9:38 pm UTC

well, to me, to outer join (join A to B, even if there isn't a match in B) in order to get all of the A's that are not in B is... counter-intuitive :)

You say "potato", I saw "potahto"

Duke Ganote, February 25, 2009 - 9:00 am UTC

Maybe it's a "regional dialect" of SQL usage? :)

SELECT emp.*
FROM scott.emp A -- all of A
LEFT OUTER JOIN scott.dept B ON ( A.deptno = B.deptno ) -- mark ( A ^ B )
WHERE B.deptno is null -- remove ( A ^ B )

Gershwin had it right...
http://en.wikipedia.org/wiki/Let's_Call_the_Whole_Thing_Off

February 25, 2009 - 5:59 pm UTC

well, technically joins are not intersects.

The outer join approach isn't a "trick" at all.  I visualize a Venn diagram where:
  A - ( A ^ B )
or: 
  "Set A minus the intersect of A and B"

that would be

select * from a
minus
(select * from a
intersect
select * from b
)

but that results in a different answer from the join :)

while a join is a set operation - technically joins of any sort are not intersect or minus 'capable'.

:)

Chuck Jolley, February 25, 2009 - 1:36 pm UTC

How about writing the query with the not exists so it is easy to read and understand, then if the optimizer cant find the optimal plan for a particular query then give it a hint on the best plan by using, well, hints?

an example

Sokrates, February 26, 2009 - 4:01 am UTC

here is an example.

I find it interesting that optimizer uses different plans (no matter which optimizer_mode I use), why ?

autotrace shows that both plans are equally expensive.
(
real world example was similar, but EMP2 was not a table but a complex view using analytics, group by and joining to some lookup tables.
there, outer join happened to be much cheaper than not exists.
)

REM grant execute on dbms_random to scott;


connect scott

delete emp;

insert into emp(empno, ename, deptno, hiredate)
select
rownum,
nls_initcap(dbms_random.string('u', trunc(dbms_random.value * 6) + 5)),
1 + trunc(99 * dbms_random.value),
trunc(sysdate - 10000 * (dbms_random.value * dbms_random.value))
from dual
connect by level < 10000;

CREATE TABLE EMP2
   (    EMPNO NUMBER(4,0),
        ENAME VARCHAR2(10),
        JOB VARCHAR2(9),
        MGR NUMBER(4,0),
        HIREDATE DATE,
        SAL NUMBER(7,2),
        COMM NUMBER(7,2),
        DEPTNO NUMBER(2,0),
         CONSTRAINT PK_EMP2 PRIMARY KEY (EMPNO)
  USING INDEX ,
         CONSTRAINT FK_DEPTNO2 FOREIGN KEY (DEPTNO)
          REFERENCES DEPT (DEPTNO)
);


insert into emp2(empno, ename, deptno, hiredate)
select
rownum,
nls_initcap(dbms_random.string('u', trunc(dbms_random.value * 6) + 5)),
1 + trunc(99 * dbms_random.value * dbms_random.value * dbms_random.value),
trunc(sysdate - 10000 * (dbms_random.value * dbms_random.value))
from dual
connect by level < 10000;


commit;

exec dbms_stats.gather_table_stats(user, 'EMP', cascade=>true)
exec dbms_stats.gather_table_stats(user, 'EMP2', cascade=>true)

I want to get each emp out of EMP for which there is no emp2 in EMP2 in the same department as emp
which was hired before the first hired emp in this department.

set lines 200 autotr traceonly feedb on

select e.*
  from scott.emp e
  where not exists
  (select null
  from scott.emp2 e2
  where e2.deptno = e.deptno
  and e2.hiredate < e.hiredate
  )
/

2048 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 3714357713

-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |  9560 |   532K|    10  (10)| 00:00:01 |
|*  1 |  HASH JOIN RIGHT ANTI|      |  9560 |   532K|    10  (10)| 00:00:01 |
|   2 |   TABLE ACCESS FULL  | EMP2 |  9999 |   107K|     7   (0)| 00:00:01 |
|   3 |   TABLE ACCESS FULL  | EMP  |   105 |  2415 |     2   (0)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("E2"."DEPTNO"="E"."DEPTNO")
       filter("E2"."HIREDATE"<"E"."HIREDATE")


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
        251  consistent gets
          0  physical reads
          0  redo size
      70699  bytes sent via SQL*Net to client
       1834  bytes received via SQL*Net from client
        138  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
       2048  rows processed


select e.*
from scott.emp e, (select deptno, min(hiredate) minhdate from scott.emp2 group by deptno) e2
where e2.deptno(+)=e.deptno
and e2.minhdate(+)<e.hiredate
and e2.deptno is null
/


2048 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 2263371308

-------------------------------------------------------------------------------
| Id  | Operation              | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |      |  9999 |   566K|    11  (19)| 00:00:01 |
|*  1 |  FILTER                |      |       |       |            |          |
|*  2 |   HASH JOIN RIGHT OUTER|      |  9999 |   566K|    11  (19)| 00:00:01 |
|   3 |    VIEW                |      |    87 |  1044 |     9  (23)| 00:00:01 |
|   4 |     HASH GROUP BY      |      |    87 |   957 |     9  (23)| 00:00:01 |
|   5 |      TABLE ACCESS FULL | EMP2 |  9999 |   107K|     7   (0)| 00:00:01 |
|   6 |    TABLE ACCESS FULL   | EMP  |   100 |  2300 |     2   (0)| 00:00:01 |
-------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("E2"."DEPTNO" IS NULL)
   2 - access("E2"."DEPTNO"(+)="E"."DEPTNO")
       filter("E2"."MINHDATE"(+)<"E"."HIREDATE")


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
        251  consistent gets
          0  physical reads
          0  redo size
      70699  bytes sent via SQL*Net to client
       1834  bytes received via SQL*Net from client
        138  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
       2048  rows processed

March 03, 2009 - 7:26 am UTC

select e.*
from scott.emp e
where not exists
(select null
from scott.emp2 e2
where e2.deptno = e.deptno
and e2.hiredate < e.hiredate
)
/

select e.*
from scott.emp e, (select deptno, min(hiredate) minhdate from scott.emp2 group
by deptno) e2
where e2.deptno(+)=e.deptno
and e2.minhdate(+)<e.hiredate
and e2.deptno is null
/

do you see what you did?

You did a bit of manual optimization - something the optimizer didn't (yet) see.

You know that if

...
where not exists
(select null
from scott.emp2 e2
where e2.deptno = e.deptno
and e2.hiredate < e.hiredate
)
....

is true, then we could just generate a (probably MUCH SMALLER) set of deptno/hiredate values - using group by.

You did it in ONE but not the other.

Hence you did not test "not exists" vs "outer join" but rather "compare to entire set, versus "group by the set down to something very small and then compare"

In short, you did not test the premise you put forth in this case.

(and analytics would have been a better way to phrase this, you wanted the set of employees such that they had the minimum hiredate - not in and anti joins, neither of them "say" that to me)

Considerations for not exists

Ryan, February 26, 2009 - 5:21 pm UTC

I agree with using not exists.

Consider a case where you want to perform more logic than just "One not in the other".

Consider Item, Sale, and SaleDetail:

create table Item
(
ItemId number(6),
ItemDescription varchar(200),
UnitPrice number(18,6)
)
/
Create table sale
(
SaleId number(6),
TransactionDate date
)
/
create table SaleDetail
(
SaleId number(6),
ItemId number(6),
Quantity number(18)

)

insert into item
values (1,'Toyota Prius',18500.95)
/
insert into item
values (2,'Lock Washer', .004)
/

insert into sale
values (1000,'15-Jun-2008')
/
insert into sale
values (1001,'23-Aug-2008')
/
insert into sale
values (1002,'05-Jan-2009')
/
insert into sale
values (1003,'09-Jan-2009')
/
insert into sale
values (1004,'15-Feb-2009')
/

insert into SaleDetail
values (1000, 1, 1)
/
insert into SaleDetail
values (1001, 1, 2)
/
insert into SaleDetail
values (1002, 1, 2)
/

insert into SaleDetail
values (1003, 2, 1000)
/
insert into SaleDetail
values (1004, 2, 1900)
/

Now, I want to know which items from the item table have yet to be sold in 2009 in a quantity greater than 100 (what have i not yet sold in bulk this year?).

Two alternatives using outer join logic (both nasty to read and comprehend):

select i.*
from item i
left outer join saleDetail sd
ON i.itemId = sd.ItemId
AND sd.quantity > 100
Left outer join sale s
ON sd.saleId = s.saleid
and trunc(s.transactionDate,'YYYY') = to_date('01-JAN-2009','DD-MON-YYYY')
WHERE sd.itemId is null

or

select i.*
from item i
left outer join
(select sd_i.*
from saleDetail sd_i
JOIN sale s
on s.saleid = sd_i.saleid
where trunc(s.transactionDate,'YYYY') = to_date('01-JAN-2009','DD-MON-YYYY')) sd

ON i.itemId = sd.ItemId
AND sd.quantity > 100
WHERE sd.itemId is null

Now again, using the not exists (much easier to read and much easier to test what HAS been sold in bulk in 2009 by removing the inner to outer join on itemId):

select *
from item i
where not exists (select 'x'
from saleDetail sd
JOIN sale s
ON sd.saleId = s.saleId
where sd.itemId =i.itemId
and quantity > 100
and trunc(s.transactionDate,'YYYY') =
to_date('01-JAN-2009','DD-MON-YYYY')
)

Many developers would even try moving the where-like and conditions in the outer join syntax into the body of the where clause, thinking that it was just misplaced initially. Once it's in the where, you've created an inner join dataset and no inner join would have the "non-inner-join anti join" condition.

I'm pro anti-join

Duke Ganote, February 27, 2009 - 12:28 pm UTC

After standardizing the subquery in Ryan's example, I now realize a couple of reasons I prefer the LEFT JOIN version:
1) ANSI-standard join syntax
2) reinforcement of the idea that WHERE filters are applied after the join.

WITH 
saleSET AS
(SELECT sd.itemID
   FROM saleDetail sd
        JOIN sale s
          ON sd.saleId = s.saleId
    AND sd.quantity > 100
    AND s.transactionDate BETWEEN TO_DATE('20090101','YYYYMMDD')
                              AND TO_DATE('20091231','YYYYMMDD')
)
SELECT count(*)
  FROM item i
       LEFT JOIN saleSET sd
       ON ( sd.itemId =i.itemId )
 WHERE sd.itemID IS NULL

In the NOT EXISTS version I have to go searching for the join condition in the WHERE clause and find the correlated reference.

I don't like commingled join and filter conditions, which are required for a correlated subquery.

WITH
saleSET AS
(SELECT sd.itemID
   FROM saleDetail sd
        JOIN sale s
          ON sd.saleId = s.saleId
    AND sd.quantity > 100
    AND s.transactionDate BETWEEN TO_DATE('20090101','YYYYMMDD')
                              AND TO_DATE('20091231','YYYYMMDD')
)
SELECT count(*)
  FROM item i
       WHERE NOT EXISTS ( SELECT itemID
                            FROM saleSET sd
                           WHERE sd.itemId =i.itemId )

http://en.wikipedia.org/wiki/Relational_algebra#Antijoin

March 03, 2009 - 9:51 am UTC

... 2) reinforcement of the idea that WHERE filters are applied after the join. ...

that is only in concept. It doesn't have to be that way

"doesn't have to be that way..."

Duke Ganote, March 03, 2009 - 10:16 am UTC

I believe you're saying there can be (and are) internal optimizations, some of which may apply filters before joining. However, the results must be equivalent to logically/conceptually applying filters (WHERE clauses) after join conditions.
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:6585774577187#56266225203647

March 03, 2009 - 5:01 pm UTC

yup, so - to say the where happens after the join - isn't really technically accurate, conceptually you can think of it that way.

I say this, because lots of people think:

select *
from (select * from generic_table where datatype = 'NUMBER' )
where to_number( value ) > 42;

would "force" the where datatype = 'NUMBER' to be evaluated first, so that the to_number(value) would be safe (generic model, store number, dates and strings in value, use datatype to figure out what it really is at runtime).

But, it doesn't.

Chuck, March 03, 2009 - 4:13 pm UTC

Possibly this demonstrates the stoneage dating of my database theory classes, but isn't one of the primary rules of SQL the proposition that there is no order of operation?
In other words, a SQL query is a single statement.
This implies that all access orders for a query must produce equivalent results.
Otherwise SQL would not be a logically consistent language.

March 03, 2009 - 9:21 pm UTC

SQL is a non-procedural language, yes.

It is like a complex definition of a program - that is then machine coded. Think of SQL like a specification - you are saying what needs be done, not how to do it so much.

Order of evaluation

Duke Ganote, March 04, 2009 - 1:15 am UTC

There must be an order of evaluation within an SQL statement. For example "Analytic functions are the last set of operations performed in a query except for the final ORDER BY clause. All joins and all WHERE, GROUP BY, and HAVING clauses are completed before the analytic functions are processed. Therefore, analytic functions can appear only in the select list or ORDER BY clause."

http://download-west.oracle.com/docs/html/A85397_01/function.htm#81409

March 04, 2009 - 1:05 pm UTC

Certain things sometimes impose ordering, but sql is non-procedural. We do view merging, query rewrites, predicate pushing - all kinds of things.

Don't think a parenthesis means "this will happen first"

Specifically, the example I gave:

select *
from (select * from generic_table where datatype = 'NUMBER' )
where to_number( value ) > 42;

you do not know if datatype='number' or to_number(value) will happen FIRST.

Specifically, the example YOU gave - you cannot say the where will happen after the join, it doesn't have to.

RE: "you cannot say the where will happen after the join"

Duke Ganote, March 04, 2009 - 5:06 pm UTC

You wrote: "the example YOU gave - you cannot say the where will happen after the join, it doesn't have to."

Sorry, I must not grasp which example you're referencing.

Surely you can't mean a LEFT JOIN example (like the one below) when you write that.

 
  select DISTINCT e.deptno
    From emp e left join dept d
         ON ( e.deptno = d.deptno )
   where d.deptno is null
  /
  no rows selected

It's a vastly different answer if the WHERE filter is applied first by making it an ON filter:

  select DISTINCT e.deptno
    From emp e left join dept d
         ON ( e.deptno = d.deptno
          AND d.deptno is null )
  /
  DEPTNO
  ------
      10
      20
      30

March 04, 2009 - 8:34 pm UTC

they can happen simultaneously in a way, think about

select DISTINCT e.deptno
From emp e left join dept d
ON ( e.deptno = d.deptno )
where d.deptno is null or d.deptno = 10
/

could it apply a bit of the where clause first? Sure it could. (by your example, I meant - "it shows the outer join is done before the where).

Chuck, March 04, 2009 - 9:47 pm UTC

For once I think Tom read your question too quickly.
Your last two examples aren't logically the same query.

The point is substituting and outer join with an outer not null for a "not exists"

March 05, 2009 - 1:23 pm UTC

I'm not sure what you mean - whose question?

RE: "can happen simultaneously in a way"

Duke Ganote, March 05, 2009 - 2:10 pm UTC

I think you're making two observations:

1) There are useless outer joins that the optimizer *could* avoid, e.g.
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:4963137609733
where you wrote:

"If you see a query like this:

from t1, t2
where t1.x = t2.x(+)
and t2.any_other_column <some condition>

"you know for a fact that outer join is a waste, a performance drain, something to be removed from the query."

I think this example actually supports my idea of using the ANSI OUTER JOIN approach, in part as a conceptual reinforcement that *logically* WHERE occurs after JOIN.

2) unexpected ORA-01722. The exception that "proves (tests) the rule".
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:45012348053

You note that the optimizer's chosen path between starting point "A" and requested endpoint "B" may go straight through a minefield... even though the "as written" query said "A -> C -> B" in order to avoid the minefield. (And that's where one could use ROWNUM or MATERIALIZE or the like to tell the optimizer "no *really*, A -> C -> B".)

But I think that issue of "suboptimal (fatal) optimization" could arise even with the NOT EXISTS form of the question, and is not peculiar to the OUTER JOIN form.

March 05, 2009 - 3:02 pm UTC

1) Oh, don't get me wrong, if I wanted to do an outer join, the ansi syntax does have definite semantic advantages (notice how I slipped in there "if i wanted to do a join :)")

2) agreed

It's not a join...

Duke Ganote, March 12, 2009 - 9:37 pm UTC

There's just no "anti join" syntax, say, like this!

SELECT count(*)
FROM item i
ANTI JOIN saleSET sd
ON ( sd.itemId =i.itemId )
;

March 13, 2009 - 7:25 am UTC

sure there is....

where not in
where not exists

:)

You don't want an "anti join" (not any more than you 'want' a sort merge or a nested loop or a hash join, you want a JOIN, the mechanics - that is up to the optimizer.

You don't want a join at all, you want a set of data such that it is NOT IN some other set... To me, using a join to perform that, is semantically "awkward"

but I do!

Duke Ganote, March 13, 2009 - 12:11 pm UTC

I'm thinking Relational Algebra where anti-join ranks with natural-join, equi-join, and outer-join.
http://en.wikipedia.org/wiki/Relational_algebra#Antijoin

Dan Tow's put his finger on the "wrong error" (e.g. unexpected ORA-01722):
http://www.onlamp.com/pub/a/onlamp/2004/09/02/wrongerrorbugs.html

March 13, 2009 - 2:58 pm UTC

I do not believe "anti-join" was one of the original "we are joins". It is an optimizer technique, it ISN"T something we code. There is no syntax for an 'anti-join' really, there isn't a "anti-join" like the others (natural join, outer join, etc)

We code "give me this", the optimizer says "I'll do a nested loops join, I'll do a sort merge, I'll do a hash, I'll do an anti-join".

There isn't anything like an "anti-join" in the language. There is "give me everything in this SET minus that SET (not an anti-join, a minus)", "give me everything in set SET not in that other SET (not an anti-join, a not in - SOMETIMES equivalent to not exists)" and so on.

I will always code "give me everything in this NOT IN that" as a not in/not exists, to me - it just makes much more semantic sense than "let us outer join AND THEN throw out the rows that match", that is all...

In the fullness of....

Duke Ganote, March 13, 2009 - 10:23 pm UTC

There should be a full anti join!

SELECT count(*)
FROM item i
FULL ANTI JOIN saleSET sd
ON ( sd.itemId =i.itemId )

as the equivalent of...

SELECT count(*)
FROM item i
FULL OUTER JOIN saleSET sd
ON ( sd.itemId =i.itemId )
WHERE sd.itemID IS NULL
OR i.itemID IS NULL

:)

March 16, 2009 - 8:59 am UTC

there is!

select count(*) from item where not exists! :)

Anti-Join

Manuel Vidigal, March 31, 2009 - 1:26 pm UTC

Hi Tom,

I'm having a different explain plan in Dev and Prod (both are version 10.2.0) for this query:

SELECT *
FROM swiadm.arruamento_tmp t
WHERE NOT EXISTS(SELECT NULL
FROM swiadm.arruamento a
WHERE a.ID = t.ID);

Dev:
------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 120 | | 3492 (1)| 00:00:42 |
|* 1 | HASH JOIN RIGHT ANTI | | 1 | 120 | 7008K| 3492 (1)| 00:00:42 |
| 2 | INDEX FAST FULL SCAN| PK_ARRUAMENTO | 422K| 2060K| | 241 (2)| 00:00:03 |
| 3 | TABLE ACCESS FULL | ARRUAMENTO_TMP | 282K| 30M| | 1210 (2)| 00:00:15 |
------------------------------------------------------------------------------------------------

Prod:
-------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 282K| 21M| 543K (1)| 01:48:47 |
|* 1 | FILTER | | | | | |
| 2 | TABLE ACCESS FULL| ARRUAMENTO_TMP | 282K| 21M| 1193 (2)| 00:00:15 |
|* 3 | INDEX UNIQUE SCAN| PK_ARRUAMENTO | 1 | 4 | 2 (0)| 00:00:01 |
-------------------------------------------------------------------------------------

Do you have any idea of why the optimizer in prod doesn't use the anti-join?

Thanks in advance.

April 01, 2009 - 7:31 am UTC

what say the statistics? Query them up for that arruamento table and compare them.

Also, what parameters are different?

select name, value from v$parameter where isdefault = 'FALSE';

compare them and post the differences.

RE: Anti-Join

Manuel Vidigal, April 01, 2009 - 10:59 am UTC

Hi Tom,

I'm not able to get the statistics in prod. It's an endless query.

Hi found out that if I use the UNNEST hint, I can force the optimizer to use the anti-join and the query ends in less than 1 minute:

SELECT *
  FROM swiadm.arruamento_tmp t
 WHERE NOT EXISTS(SELECT /*+ UNNEST */NULL
                    FROM swiadm.arruamento a
                   WHERE a.ID = t.ID);

Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=ALL_ROWS (Cost=3363 Card=1 Bytes=123)
   1    0   HASH JOIN (RIGHT ANTI) (Cost=3363 Card=1 Bytes=123)
   2    1     INDEX (FAST FULL SCAN) OF 'PK_ARRUAMENTO' (INDEX (UNIQUE))
       (Cost=184 Card=306128 Bytes=1530640)
   3    1     TABLE ACCESS (FULL) OF 'ARRUAMENTO_TMP' (TABLE) (Cost=11
          91 Card=282144 Bytes=33292992)


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
       6903  consistent gets
       6179  physical reads
          0  redo size
    5062279  bytes sent via SQL*Net to client
       7489  bytes received via SQL*Net from client
        644  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
      64300  rows processed

I also checked the Parameter and this was what I found:

Different Parameters

NAME                                  PROD VALUE  DEV VALUE
filesystemio_options                  DIRECTIO    none  (is default)
java_max_sessionspace_size            120000000   128000000
java_pool_size                        134217728   130023424
large_pool_size                       134217728   96468992
open_cursors                          1000        600
optimizer_secure_view_merging         FALSE       TRUE (is default)
pga_aggregate_target                  838860800   512000000
processes                             500         200
shared_pool_size                      939524096   130023424
streams_pool_size                     100663296   0 (is default)
_optimizer_cost_based_transformation  OFF         Doesn't exist in v$parameter

Does something pop up in your mind?

April 01, 2009 - 4:20 pm UTC

yes you can get the statistics - if you can access the table, you can get the statistics out of the dictionary??

yeah, that underscore parameter. bingo.

Manuel Vidigal, April 01, 2009 - 11:17 am UTC

Hi think I've found the awnser.

Using ALTER SESSION SET "_optimizer_cost_based_transformation" = ON ;

The explain plan started using the anti-join.

Do you know of any reason why they have this parameter set to off?

Thanks,
Manuel

April 01, 2009 - 4:26 pm UTC

you should really consult the person that set it on - to find out why they did so.

Thanks Tom

A reader, April 01, 2009 - 6:15 pm UTC

Thank you for your time.

Not Exists and a join

Maria Antony Samy Issac, August 01, 2011 - 8:21 am UTC

Hi Tom,

I got below given tables.

TABLE_A, having over 1 million records - got an index on the column TABLE_A.ID.
TABLE_B, having over 1 million records - got an index on the column TABLE_B.ID.
TABLE_C, having about 2000 records - got an index on the column TABLE_C.ID.

TABLE_B contains the IDs that are valid one. So, to validate the records of TABLE_A,
we use below given join.

SELECT A.ID
FROM TABLE_A A,
TABLE_B B
WHERE
AND A.ID = B.ID

TABLE_C contains the IDs that are not valid. So, the above query to validate the records of TABLE_A can be written as

SELECT A.ID
FROM TABLE_A A
WHERE NOT EXISTS ( SELECT 1 FROM TABLE_C C
WHERE A.ID = C.ID
)

I said to my friend that the second approach is better over the first. But he thinks the other way. Please suggest.

Note: The table TABLE_A and TABLE_B will keep growing as we get more customers. TABLE_C will also keep growing, but in the same ratio as it is now ( i.e about 2000 for every million records in TABLE_A)

August 01, 2011 - 12:03 pm UTC

depends on the plans.

I'd lean towards:

select * from table_a a where a.id not in (select c.id from table_c c where c.id is not null);

and I'd be looking for two full scans and a hash anti-join.

Not Exists

A reader, September 07, 2011 - 2:03 pm UTC

hi Tom,

I have a situation where I have to join multiple tables, say 6 tables which are around 1GB in size with 1 million of records.

Now I have to join three tables in normal way with their keys. With one table I have to check the sum of a column based on ID > 0 in that. And I have two big tables with which I have to check not exists.

In this example table_1 can be considered as driving table with col_1 being a primary key.

SELECT t1.col1, t2.col2, t3.col_2
from table_1 t1, table_2 t2, table_3 t3
where t1.col_1 = t2.col2
and t2.col_3=t3.col_1
and (select sum(abs(t4.col_2))from table_4 t4 where t4.col_1=t1.col_1) > 0
and not exists(select null from table_5 t5 where t5.col_1=t1.col_1 and t5.col_2 not in (100,200))
and not exists(select null from table_6 t6 where t6.col_1=t1.col_1);

Now my not NOT EXIST check takes a fare bit of time. May be because I also join with such a table where I do SUM operation.

Is there something which works faster than NOT EXISTS? How can I reframe the query to be a good faster one? Any use of analytics may help?

Apologies for not providing the script as these are huge tables.

thanks for your time.

September 08, 2011 - 5:04 pm UTC

I would not be using scalar subqueries at all.

with t4
as
(select sum(abs(t4.col_2)) sum_col_2, col_1
from table_4 t4
group by col_1 ),
t5
as
(select col_1
from table_5
where col_2 not in (100,200)
and col_1 is NOT NULL),
t6
as
(select col_1
from table_6
where col_1 IS NOT NULL)
SELECT t1.col1, t2.col2, t3.col_2
from table_1 t1, table_2 t2, table_3 t3, t4
where t1.col_1 = t2.col2
and t2.col_3=t3.col_1
and t4.col_1 = t1.col_1 and t4.sum_col_2 > 0
and t1.col_1 NOT IN (select * from t5)
and t1.col_1 NOT IN (select * from t6)
/

You'd be looking to join to t4 after aggregating it up, you'd be looking for ANTI-JOINS to t5 and t6 (if you don't see anti-join in the plan, use:

with t4
as
(select sum(abs(t4.col_2)) sum_col_2, col_1
from table_4 t4
group by col_1 ),
t5
as
(select /* add distinct if needed */ distint col_1
from table_5
where col_2 not in (100,200)
and col_1 is NOT NULL),
t6
as
(select /* add distinct if needed */ distinct col_1
from table_6
where col_1 IS NOT NULL)
SELECT t1.col1, t2.col2, t3.col_2
from table_1 t1, table_2 t2, table_3 t3, t4, t5, t6
where t1.col_1 = t2.col2
and t2.col_3=t3.col_1
and t4.col_1 = t1.col_1 and t4.sum_col_2 > 0
and t5.col_1(+) = t1.col_1 and t5.col_1 is null
and t6.col_1 = t1.col_1 and t6.col_1 is null
/

I hope I got the syntax right - and the semantics right - but without a TEST CASE it is really really hard to write sql you know....

More to Explore

Performance

Get all the information about database performance in the Database Performance guide.

Breadcrumb

Question and Answer

You Asked

and Tom said...

Rating

Comments

don't just say "one time I replaced a not exists with outer join and it was better"

it depends

OBJECTION to "outer join trick just makes people scratch their heads"

beyond the 90%

You say "potato", I saw "potahto"

an example

Considerations for not exists

I'm pro anti-join

"doesn't have to be that way..."

Order of evaluation

RE: "you cannot say the where will happen after the join"

RE: "can happen simultaneously in a way"

It's not a join...

but I do!

In the fullness of....

Anti-Join

RE: Anti-Join

Thanks Tom

Not Exists and a join

Not Exists

More to Explore

Performance

"not exists" does not exist (for me)

Breadcrumb

Question and Answer

You Asked

and Tom said...

Rating

Comments

don't just say "one time I replaced a not exists with outer join and it was better"

it depends

OBJECTION to "outer join trick just makes people scratch their heads"

beyond the 90%

You say "potato", I saw "potahto"

an example

Considerations for not exists

I'm pro anti-join

"doesn't have to be that way..."

Order of evaluation

RE: "you cannot say the where will happen after the join"

RE: "can happen simultaneously in a way"

It's not a join...

but I do!

In the fullness of....

Anti-Join

RE: Anti-Join

Thanks Tom

Not Exists and a join

Not Exists

More to Explore

Performance