Here's what a quick-and-dirty test shows:
I've built four tables. Two with no duplicates; two with 50% duplicates. There are no duplicates between tables. i.e. each value only exists in one table.
Then compared the following three options:
- UNION the tables
- UNION ALL the tables, applying DISTINCT to the tables with duplicates
- UNION the tables, applying DISTINCT to the tables with duplicates
Which shows:
create table no_dups_1 (
c1 primary key
) as
select level from dual
connect by level <= 10;
create table no_dups_2 (
c1
) as
select level+10 from dual
connect by level <= 10;
create table half_dups_1 (
c1
) as
select mod ( level, 2 ) + 21
from dual
connect by level <= 10;
create table half_dups_2 (
c1
) as
select mod ( level, 2 ) + 23
from dual
connect by level <= 10;
set serveroutput off
alter session set statistics_level = all;
select * from no_dups_1
union
select * from no_dups_2
union
select * from half_dups_1
union
select * from half_dups_2;
-----------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
-----------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 24 |
| 1 | SORT UNIQUE | | 1 | 40 | 24 |
| 2 | UNION-ALL | | 1 | | 40 |
| 3 | INDEX FULL SCAN | SYS_C0046833 | 1 | 10 | 10 |
| 4 | TABLE ACCESS FULL| NO_DUPS_2 | 1 | 10 | 10 |
| 5 | TABLE ACCESS FULL| HALF_DUPS_1 | 1 | 10 | 10 |
| 6 | TABLE ACCESS FULL| HALF_DUPS_2 | 1 | 10 | 10 |
-----------------------------------------------------------------------
select *
from table(dbms_xplan.display_cursor(null, null, 'ROWSTATS LAST'));
select * from no_dups_1
union all
select * from no_dups_2
union all
select distinct c1 from half_dups_1
union all
select distinct c1 from half_dups_2;
-----------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
-----------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 24 |
| 1 | UNION-ALL | | 1 | | 24 |
| 2 | INDEX FULL SCAN | SYS_C0046833 | 1 | 10 | 10 |
| 3 | TABLE ACCESS FULL | NO_DUPS_2 | 1 | 10 | 10 |
| 4 | HASH UNIQUE | | 1 | 10 | 2 |
| 5 | TABLE ACCESS FULL| HALF_DUPS_1 | 1 | 10 | 10 |
| 6 | HASH UNIQUE | | 1 | 10 | 2 |
| 7 | TABLE ACCESS FULL| HALF_DUPS_2 | 1 | 10 | 10 |
-----------------------------------------------------------------------
select *
from table(dbms_xplan.display_cursor(null, null, 'ROWSTATS LAST'));
select * from no_dups_1
union
select * from no_dups_2
union
select distinct c1 from half_dups_1
union
select distinct c1 from half_dups_2;
select *
from table(dbms_xplan.display_cursor(null, null, 'ROWSTATS LAST'));
-----------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
-----------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 24 |
| 1 | SORT UNIQUE | | 1 | 40 | 24 |
| 2 | UNION-ALL | | 1 | | 40 |
| 3 | INDEX FULL SCAN | SYS_C0046833 | 1 | 10 | 10 |
| 4 | TABLE ACCESS FULL| NO_DUPS_2 | 1 | 10 | 10 |
| 5 | TABLE ACCESS FULL| HALF_DUPS_1 | 1 | 10 | 10 |
| 6 | TABLE ACCESS FULL| HALF_DUPS_2 | 1 | 10 | 10 |
-----------------------------------------------------------------------
So:
UNION & UNION + DISTINCT give the same plan. In both cases the database pushes the * UNIQUE to the top. Instead of applying DISTINCT at the table level.
So unless you get creative (e.g. put the DISTINCT checks in a non-mergable subquery), these come out "the same". There's no duplication of the DISTINCT operation.
If you know your tables contain distinct values (so you can UNION ALL their distinct results), pushing the DISTINCT to the table level
might come out better; I'd want to test on the real data set before deciding which method to use.