You Asked

Hello Tom,

my test case
create table xxx as select * from dba_tables;
insert into xxx select * from dba_tables;

I tried 2 queries
1* select distinct * from xxx
2359 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 4052720149
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2358 | 580K| 28 (4)| 00:00:01 |
| 1 | HASH UNIQUE | | 2358 | 580K| 28 (4)| 00:00:01 |
| 2 | TABLE ACCESS FULL| XXX | 2358 | 580K| 27 (0)| 00:00:01 |
---------------------------------------------------------------------------
and
TEST@pdborcl>select * from xxx
2 union
3 select * from xxx where 1=2;
2359 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 2153587207
-----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2359 | 580K| 29 (7)| 00:00:01 |
| 1 | SORT UNIQUE | | 2359 | 580K| 29 (7)| 00:00:01 |
| 2 | UNION-ALL | | | | | |
| 3 | TABLE ACCESS FULL | XXX | 2358 | 580K| 27 (0)| 00:00:01 |
|* 4 | FILTER | | | | | |
| 5 | TABLE ACCESS FULL| XXX | 2358 | 580K| 27 (0)| 00:00:01 |
-----------------------------------------------------------------------------
both produced the same output, but plan is different. Will it make any performance difference on really big data ?

and Chris said...

There's only one real answer to this:

"It depends"

The execution plans are different, so yes it's possible that they will perform differently.

In the second query, Oracle knows that 1=2 returns no results. So in theory the overhead of this step will be minimal.

The two queries use different mechanisms for removing duplicates (hash unique vs sort unique). So it's possible that they will perform differently depending upon your data.

Stick with distinct. It's easier to understand.

Rating

(3 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Jess, February 21, 2019 - 10:54 am UTC

Hi Chris,

With the understanding of that it usually depends...

We have a query that's essentially about 50-80 individual selects all union'd to produce a giant list of output data. It runs in about 5-6 minutes (which is deemed acceptable, as only runs a few times a day on a schedule).

About half of the individual queries produce results that are implicitly already distinct. The other half produces duplicates (that are ultimately removed by the unions).

The queries with duplicates return anywhere between 6K and 15K rows each, and about 50% of results in each are dupes.

I can't quite work out which is nominally "better":
- let the queries select what they have, hold those results in memory (which is already holding everything else and a box of doughnuts) until all queries finish and then let the union produce the distinct set
or
- add distinct to some if not all of the queries with high duplicate percentage, so that deduplication is nominally done twice, but there's less result data being held while all the queries in the union are running to completion.

What is your view/advice?

February 21, 2019 - 4:08 pm UTC

Here's what a quick-and-dirty test shows:

I've built four tables. Two with no duplicates; two with 50% duplicates. There are no duplicates between tables. i.e. each value only exists in one table.

Then compared the following three options:

- UNION the tables
- UNION ALL the tables, applying DISTINCT to the tables with duplicates
- UNION the tables, applying DISTINCT to the tables with duplicates

Which shows:

create table no_dups_1 (
  c1 primary key
) as 
  select level from dual
  connect by level <= 10;

create table no_dups_2 (
  c1 
) as 
  select level+10 from dual
  connect by level <= 10;
  
create table half_dups_1 (
  c1 
) as 
  select mod ( level, 2 ) + 21
  from   dual
  connect by level <= 10;
  
create table half_dups_2 (
  c1 
) as 
  select mod ( level, 2 ) + 23
  from   dual
  connect by level <= 10;

set serveroutput off
alter session set statistics_level = all;

select * from no_dups_1
union  
select * from no_dups_2
union  
select * from half_dups_1
union  
select * from half_dups_2;

-----------------------------------------------------------------------   
| Id  | Operation           | Name         | Starts | E-Rows | A-Rows |   
-----------------------------------------------------------------------   
|   0 | SELECT STATEMENT    |              |      1 |        |     24 |   
|   1 |  SORT UNIQUE        |              |      1 |     40 |     24 |   
|   2 |   UNION-ALL         |              |      1 |        |     40 |   
|   3 |    INDEX FULL SCAN  | SYS_C0046833 |      1 |     10 |     10 |   
|   4 |    TABLE ACCESS FULL| NO_DUPS_2    |      1 |     10 |     10 |   
|   5 |    TABLE ACCESS FULL| HALF_DUPS_1  |      1 |     10 |     10 |   
|   6 |    TABLE ACCESS FULL| HALF_DUPS_2  |      1 |     10 |     10 |   
-----------------------------------------------------------------------

select * 
from   table(dbms_xplan.display_cursor(null, null, 'ROWSTATS LAST'));

select * from no_dups_1
union  all
select * from no_dups_2
union  all  
select distinct c1 from half_dups_1
union  all  
select distinct c1 from half_dups_2;

-----------------------------------------------------------------------    
| Id  | Operation           | Name         | Starts | E-Rows | A-Rows |    
-----------------------------------------------------------------------    
|   0 | SELECT STATEMENT    |              |      1 |        |     24 |    
|   1 |  UNION-ALL          |              |      1 |        |     24 |    
|   2 |   INDEX FULL SCAN   | SYS_C0046833 |      1 |     10 |     10 |    
|   3 |   TABLE ACCESS FULL | NO_DUPS_2    |      1 |     10 |     10 |    
|   4 |   HASH UNIQUE       |              |      1 |     10 |      2 |    
|   5 |    TABLE ACCESS FULL| HALF_DUPS_1  |      1 |     10 |     10 |    
|   6 |   HASH UNIQUE       |              |      1 |     10 |      2 |    
|   7 |    TABLE ACCESS FULL| HALF_DUPS_2  |      1 |     10 |     10 |    
----------------------------------------------------------------------- 

select * 
from   table(dbms_xplan.display_cursor(null, null, 'ROWSTATS LAST'));


select * from no_dups_1
union  
select * from no_dups_2
union    
select distinct c1 from half_dups_1
union    
select distinct c1 from half_dups_2;

select * 
from   table(dbms_xplan.display_cursor(null, null, 'ROWSTATS LAST'));

-----------------------------------------------------------------------   
| Id  | Operation           | Name         | Starts | E-Rows | A-Rows |   
-----------------------------------------------------------------------   
|   0 | SELECT STATEMENT    |              |      1 |        |     24 |   
|   1 |  SORT UNIQUE        |              |      1 |     40 |     24 |   
|   2 |   UNION-ALL         |              |      1 |        |     40 |   
|   3 |    INDEX FULL SCAN  | SYS_C0046833 |      1 |     10 |     10 |   
|   4 |    TABLE ACCESS FULL| NO_DUPS_2    |      1 |     10 |     10 |   
|   5 |    TABLE ACCESS FULL| HALF_DUPS_1  |      1 |     10 |     10 |   
|   6 |    TABLE ACCESS FULL| HALF_DUPS_2  |      1 |     10 |     10 |   
-----------------------------------------------------------------------

So:

UNION & UNION + DISTINCT give the same plan. In both cases the database pushes the * UNIQUE to the top. Instead of applying DISTINCT at the table level.

So unless you get creative (e.g. put the DISTINCT checks in a non-mergable subquery), these come out "the same". There's no duplication of the DISTINCT operation.

If you know your tables contain distinct values (so you can UNION ALL their distinct results), pushing the DISTINCT to the table level might come out better; I'd want to test on the real data set before deciding which method to use.

Jess, February 21, 2019 - 4:15 pm UTC

Hi Chris,

Thank you for the example. I would've thought that there would be a tipping point... Holding 40 or 100 or 1K rows in memory shouldn't make a difference to an enterprise-sized database. Here, for example, when all is said and done (looking at average number of queries/rows), there are about 1M rows waiting to be deduplicated... Now that might still be "small", but I would've thought there would be a point of diminishing returns somewhere, no?

Thank you!

February 21, 2019 - 4:38 pm UTC

Yes, at some point the distinct operation will have to spill to disk. Exactly when that happens depends on your available memory, data types, row sizes, etc.

If you're the only user of a database with 100Gb memory available, you'll be able to sort a lot more in RAM than one user sharing 10Gb with a thousand others.

So this is very much a "it depends" question. As long as you can avoid using temp, I wouldn't worry about it.

But when you do use temp you're reading & writing from disk... which will slow your query considerably.

You can look at v$sql_workarea_histogram to see when this is happening for a given query.

My point with the example is that:

select ... union select ...

gives the same plan as

select ... union select distinct ...

So it doesn't matter whether you add distinct to each table. You need to do something more drastic to force the distinct to the table level.

Jess, February 21, 2019 - 5:52 pm UTC

Thanks Chris, that helps!

(Btw, feel free to delete the last comment. I was trying to post it in another thread, but it ended up here somehow :( :( :( )

February 22, 2019 - 6:39 am UTC

Done

More to Explore

Multitenant

Need more information on Multitenant? Check out the Multitenant docs for the Oracle Database

Breadcrumb

Question and Answer

You Asked

and Chris said...

Rating

Comments

More to Explore

Multitenant

DISTINCT vs UNION

Breadcrumb

Question and Answer

You Asked

and Chris said...

Rating

Comments

More to Explore

Multitenant