While I do understand that NOT IN is not the exact opposite of IN, Reference article here
https://www.red-gate.com/simple-talk/databases/oracle-databases/the-effects-of-null-with-not-in-on-oracle-transformations/ I had one question and hoping someone could clarify
If I had a table say SAMPLE_DATA with over one million rows. There is a column called status in this table. This is a non nullable column and hence no nulls allowed. This column has around 15 distinct values.
Now the question is, which is better?
IN (4,5,6,7,8,9,10,11,12,13,14,15)
or
NOT IN (1,2,3)
This is something which has actually caused performance issues while using NOT IN (1,2,3) and so IN (4,5,6,....) was written to performance-improve this. Is this really correct? There is an index on this column
There are a few things to consider here.
First up, even with mandatory columns, you have to be careful switching NOT IN to IN with the opposite values. The example assumes that the integers 4 - 15 covers all the possible values the column will
ever store. If someone inserts or updates a row to the value 16, the IN condition will exclude it from the results. Whereas NOT IN ( 1, 2, 3 ) will include it. This could lead to incorrect results.
Assuming you're avoiding this problem somehow, NOT IN is likely to be less efficient than IN. This is because it's harder for the optimizer to estimate how many rows it'll process and NOT IN limits access paths available.
In the example below, the table has 100 rows - 10 for each integer from 1 - 10.
When searching IN ( 1, 2, 3 ), the optimizer correctly estimates the query will return 30 rows and uses an index range scan.
But with NOT IN ( 4, 5, 6, 7, 8, 9, 10 ), it estimates 48 rows - roughly a 50% error - and uses a full table scan.
A major reason for using a full scan (in addition to the incorrect row estimate) is the database can't search the index for the values 1, 2, 3 like it can with IN. It doesn't know those are the only other values. But in most active systems we don't know for sure 4-10 are the only other values either!
So if you force the database to use an index as in the final query, it has to full scan the index. It's unsafe to use an index range scan.
create table t ( c1, c2 ) as
with rws as (
select level c1 from dual
connect by level <= 10
)
select r1.c1, sysdate from rws r1 cross join rws r2;
create index i
on t ( c1 );
exec dbms_stats.gather_table_stats ( user, 't' ) ;
set serveroutput off
select count ( c2 ) from t
where c1 in ( 1, 2, 3 );
select *
from dbms_xplan.display_cursor ( format => 'BASIC +ROWS LAST');
/*
--------------------------------------------------------------
| Id | Operation | Name | Rows |
--------------------------------------------------------------
| 0 | SELECT STATEMENT | | |
| 1 | SORT AGGREGATE | | 1 |
| 2 | INLIST ITERATOR | | |
| 3 | TABLE ACCESS BY INDEX ROWID BATCHED| T | 30 |
| 4 | INDEX RANGE SCAN | I | 30 |
--------------------------------------------------------------
*/
select count ( c2 ) from t
where c1 not in ( 4, 5, 6, 7, 8, 9, 10 );
select *
from dbms_xplan.display_cursor ( format => 'BASIC +ROWS LAST');
/*
-------------------------------------------
| Id | Operation | Name | Rows |
-------------------------------------------
| 0 | SELECT STATEMENT | | |
| 1 | SORT AGGREGATE | | 1 |
| 2 | TABLE ACCESS FULL| T | 48 |
-------------------------------------------
*/
select /*+ index ( t i ) */count ( c2 ) from t
where c1 not in ( 4, 5, 6, 7, 8, 9, 10 );
select *
from dbms_xplan.display_cursor ( format => 'BASIC +ROWS LAST');
/*
-------------------------------------------------------------
| Id | Operation | Name | Rows |
-------------------------------------------------------------
| 0 | SELECT STATEMENT | | |
| 1 | SORT AGGREGATE | | 1 |
| 2 | TABLE ACCESS BY INDEX ROWID BATCHED| T | 48 |
| 3 | INDEX FULL SCAN | I | 48 |
-------------------------------------------------------------
*/
So yes, the optimizer is more likely to choose a faster plan using IN than the equivalent NOT IN. But in real systems, it's rarely safe to swap NOT IN for IN. You have to take great care to ensure the NOT IN list includes all the other values in the column.