You Asked

While I do understand that NOT IN is not the exact opposite of IN, Reference article here
https://www.red-gate.com/simple-talk/databases/oracle-databases/the-effects-of-null-with-not-in-on-oracle-transformations/

I had one question and hoping someone could clarify

If I had a table say SAMPLE_DATA with over one million rows. There is a column called status in this table. This is a non nullable column and hence no nulls allowed. This column has around 15 distinct values.

Now the question is, which is better?

IN (4,5,6,7,8,9,10,11,12,13,14,15)
or
NOT IN (1,2,3)

This is something which has actually caused performance issues while using NOT IN (1,2,3) and so IN (4,5,6,....) was written to performance-improve this. Is this really correct? There is an index on this column

and Chris said...

There are a few things to consider here.

First up, even with mandatory columns, you have to be careful switching NOT IN to IN with the opposite values. The example assumes that the integers 4 - 15 covers all the possible values the column will ever store. If someone inserts or updates a row to the value 16, the IN condition will exclude it from the results. Whereas NOT IN ( 1, 2, 3 ) will include it. This could lead to incorrect results.

Assuming you're avoiding this problem somehow, NOT IN is likely to be less efficient than IN. This is because it's harder for the optimizer to estimate how many rows it'll process and NOT IN limits access paths available.

In the example below, the table has 100 rows - 10 for each integer from 1 - 10.

When searching IN ( 1, 2, 3 ), the optimizer correctly estimates the query will return 30 rows and uses an index range scan.

But with NOT IN ( 4, 5, 6, 7, 8, 9, 10 ), it estimates 48 rows - roughly a 50% error - and uses a full table scan.

A major reason for using a full scan (in addition to the incorrect row estimate) is the database can't search the index for the values 1, 2, 3 like it can with IN. It doesn't know those are the only other values. But in most active systems we don't know for sure 4-10 are the only other values either!

So if you force the database to use an index as in the final query, it has to full scan the index. It's unsafe to use an index range scan.

create table t ( c1, c2 ) as 
with rws as (
  select level c1 from dual
  connect by level <= 10
)
  select r1.c1, sysdate from rws r1 cross join rws r2;
  
create index i 
  on t ( c1 );
exec dbms_stats.gather_table_stats ( user, 't' ) ;
  
set serveroutput off
select count ( c2 ) from t
where  c1 in ( 1, 2, 3 );

select * 
from   dbms_xplan.display_cursor ( format => 'BASIC +ROWS LAST');
/*
--------------------------------------------------------------
| Id  | Operation                             | Name | Rows  |
--------------------------------------------------------------
|   0 | SELECT STATEMENT                      |      |       |
|   1 |  SORT AGGREGATE                       |      |     1 |
|   2 |   INLIST ITERATOR                     |      |       |
|   3 |    TABLE ACCESS BY INDEX ROWID BATCHED| T    |    30 |
|   4 |     INDEX RANGE SCAN                  | I    |    30 |
--------------------------------------------------------------
*/
select count ( c2 ) from t
where  c1 not in ( 4, 5, 6, 7, 8, 9, 10 );

select * 
from   dbms_xplan.display_cursor ( format => 'BASIC +ROWS LAST');
/*
-------------------------------------------
| Id  | Operation          | Name | Rows  |
-------------------------------------------
|   0 | SELECT STATEMENT   |      |       |
|   1 |  SORT AGGREGATE    |      |     1 |
|   2 |   TABLE ACCESS FULL| T    |    48 |
-------------------------------------------
*/
select /*+ index ( t i ) */count ( c2 ) from t
where  c1 not in ( 4, 5, 6, 7, 8, 9, 10 );

select * 
from   dbms_xplan.display_cursor ( format => 'BASIC +ROWS LAST');
/*
-------------------------------------------------------------
| Id  | Operation                            | Name | Rows  |
-------------------------------------------------------------
|   0 | SELECT STATEMENT                     |      |       |
|   1 |  SORT AGGREGATE                      |      |     1 |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T    |    48 |
|   3 |    INDEX FULL SCAN                   | I    |    48 |
-------------------------------------------------------------
*/

So yes, the optimizer is more likely to choose a faster plan using IN than the equivalent NOT IN. But in real systems, it's rarely safe to swap NOT IN for IN. You have to take great care to ensure the NOT IN list includes all the other values in the column.

Rating

(2 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Re: Which is better IN or NOT IN?

Narendra, July 13, 2022 - 4:31 pm UTC

To OP,

It might make sense to use function-based index/virtual column with index on it, in order to avoid having to use NOT IN.
For e.g. in case of the table having millions of rows, with 15 distinct values, 1-15, if the functionally correct query requires using "NOT IN (1,2,3)" predicate then a function-based index on an expression like "CASE WHEN col NOT IN (1,2,3) THEN 1 END" can help the query to be rewritten and still use index (provided the index is a better-performing choice)

July 14, 2022 - 1:48 am UTC

Agreed - but a lot of this depends the volume/spread of the values of interest versus those you want to discard

from the indexing master.

Rajeshwaran, Jeyabal, July 14, 2022 - 10:31 am UTC

The indexing master did a nice explanation about this (indexes and not-equal) from his blog @ https://richardfoote.wordpress.com/2008/08/13/indexes-and-not-equal-not-now-john/

July 14, 2022 - 4:57 pm UTC

Yep, thanks for adding

More to Explore

SQL

The Oracle documentation contains a complete SQL reference.

Breadcrumb

Question and Answer