Skip to Main Content
  • Questions
  • How to exclude rows from outer join calculation but still include them in results

Breadcrumb

Question and Answer

Chris Saxon

Thanks for the question, Mikhail.

Asked: April 14, 2021 - 11:43 am UTC

Last updated: April 20, 2021 - 4:21 pm UTC

Version: 12.1.0.2

Viewed 100+ times

You Asked

Hello Chris, Hello Connor.

I have a view with a schedule of some kind of launches:
create or replace view v_schedule as 
select trunc(sysdate) + interval'1'minute * level as plan_date 
from dual 
connect by level <= 100000


And a log of actual launch dates:
create table t_launches as
select trunc(sysdate) + interval'1'minute * level + mod(abs(dbms_random.random), 60) * interval'1'second as launch_date
from dual
connect by level <= 1000;


I want to join them like that:
create table t1 as
select plan_date, launch_date
from (
  select  plan_date,
          lead(plan_date) over (order by plan_date) as next_plan_date
  from v_schedule
)
  left join t_launches
    on (launch_date >= plan_date
    and launch_date < next_plan_date);


This works for about 10 seconds.

But I have an extra piece of information: launch_date cannot be larger than sysdate(trunc(sysdate) + interval'1'minute * 1000 in this example). So for 99% of rows in v_schedule there is no point in even trying to join them with anything in t_launches.

So a query like that works for about half a second:
create table t2 as
select plan_date, launch_date
from (
  select  plan_date,
          lead(plan_date) over (order by plan_date) as next_plan_date
  from v_schedule
  where plan_date <= trunc(sysdate) + interval'1'minute * 1000
)
  left join t_launches
    on (launch_date >= plan_date
    and launch_date < next_plan_date)
union all
select plan_date, null
from v_schedule
where plan_date > trunc(sysdate) + interval'1'minute * 1000;


My question is:
Is there any way to tell Oracle to filter rows from an outer join calculation but still include them in the results?

I thought this should work, but it doesn't:
create table t3 as
select plan_date, launch_date
from (
  select  plan_date,
          lead(plan_date) over (order by plan_date) as next_plan_date
  from v_schedule
)
  left join t_launches
    on (plan_date <= trunc(sysdate) + interval'1'minute * 1000
    and launch_date >= plan_date
    and launch_date < next_plan_date);


I would very much like to avoid union all because my actual query is much larger and complicated than this and union all would almost double its size. Besides, it does an extra pass over v_schedule.

with LiveSQL Test Case:

and we said...

It's up to the optimizer to decide how to join the tables. If you want to ensure it only considers a subset of rows, you need to do something clever like union the subqueries as in the second query.

But there's no need to do something like that here.

The plan for the first query is:

---------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name       | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |            |      1 |        |    100K|00:00:15.39 |     300K|
|   1 |  NESTED LOOPS OUTER              |            |      1 |      3 |    100K|00:00:15.39 |     300K|
|   2 |   VIEW                           |            |      1 |      1 |    100K|00:00:00.59 |       0 |
|   3 |    WINDOW SORT                   |            |      1 |      1 |    100K|00:00:00.52 |       0 |
|   4 |     VIEW                         | V_SCHEDULE |      1 |      1 |    100K|00:00:00.19 |       0 |
|   5 |      CONNECT BY WITHOUT FILTERING|            |      1 |        |    100K|00:00:00.06 |       0 |
|   6 |       FAST DUAL                  |            |      1 |      1 |      1 |00:00:00.01 |       0 |
|*  7 |   TABLE ACCESS FULL              | T_LAUNCHES |    100K|      3 |   1000 |00:00:14.72 |     300K|
---------------------------------------------------------------------------------------------------------


The optimizer thinks it's only fetching one row from V_SCHEDULE. This leads it to believe nested loops are a reasonable join method.

When in fact V_SCHEDULE returns 100,000 rows. Meaning it does 100,000 full scans of T_LAUNCHES. No wonder it's slow!

The solution here is straightforward: create an index.

This enables the database to do efficient lookups of T_LAUNCHES instead of scanning the whole table:

create index launch_date_i
 on t_launches ( launch_date );

select plan_date, launch_date
from (
  select  plan_date,
          lead(plan_date) over (order by plan_date) as next_plan_date
  from v_schedule
)
  left join t_launches
  on (launch_date >= plan_date
  and launch_date < next_plan_date);

---------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name          | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
---------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |               |      1 |        |    100K|00:00:00.71 |     123 |      4 |
|   1 |  NESTED LOOPS OUTER              |               |      1 |      3 |    100K|00:00:00.71 |     123 |      4 |
|   2 |   VIEW                           |               |      1 |      1 |    100K|00:00:00.58 |       0 |      0 |
|   3 |    WINDOW SORT                   |               |      1 |      1 |    100K|00:00:00.54 |       0 |      0 |
|   4 |     VIEW                         | V_SCHEDULE    |      1 |      1 |    100K|00:00:00.32 |       0 |      0 |
|   5 |      CONNECT BY WITHOUT FILTERING|               |      1 |        |    100K|00:00:00.08 |       0 |      0 |
|   6 |       FAST DUAL                  |               |      1 |      1 |      1 |00:00:00.01 |       0 |      0 |
|*  7 |   INDEX RANGE SCAN               | LAUNCH_DATE_I |    100K|      3 |   1000 |00:00:00.09 |     123 |      4 |
---------------------------------------------------------------------------------------------------------------------


This drops the query down from ~15s to <1s for me. The estimates for V_SCHEDULE are still way off though. To improve these, turn this into a real table:

create table t_schedule as
  select plan_date
  from   v_schedule;

select plan_date, launch_date
from (
  select  plan_date,
          lead(plan_date) over (order by plan_date) as next_plan_date
  from t_schedule
)
  left join t_launches
  on (launch_date >= plan_date
  and launch_date < next_plan_date);

------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name          | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |               |      1 |        |    100K|00:00:00.65 |     190 |    180 |
|   1 |  MERGE JOIN OUTER       |               |      1 |    100K|    100K|00:00:00.65 |     190 |    180 |
|   2 |   VIEW                  |               |      1 |    100K|    100K|00:00:00.26 |     183 |    180 |
|   3 |    WINDOW SORT          |               |      1 |    100K|    100K|00:00:00.22 |     183 |    180 |
|   4 |     TABLE ACCESS FULL   | T_SCHEDULE    |      1 |    100K|    100K|00:00:00.07 |     183 |    180 |
|*  5 |   FILTER                |               |    100K|        |   1000 |00:00:00.32 |       7 |      0 |
|*  6 |    SORT JOIN            |               |    100K|   1000 |    500K|00:00:00.18 |       7 |      0 |
|   7 |     INDEX FAST FULL SCAN| LAUNCH_DATE_I |      1 |   1000 |   1000 |00:00:00.01 |       7 |      0 |
------------------------------------------------------------------------------------------------------------


The row estimates for T_SCHEDULE are now bang on. For this specific query it doesn't really matter, but as you mention this is part of a larger query the better estimates may give a better plan overall.

Rating

  (2 ratings)

Comments

Wouldn't that be a good optimisation trick?

Mikhail Onishchenko, April 16, 2021 - 11:37 am UTC

Hey Chris, thank you for the answer!

Even if it works good enough with the index, 99% of the work oracle does in this join is useless.

Wouldn't it be a good optimization for join execution if oracle considered the predicates in the 'on (...)' clause that only depend on values in one table and filter rows based on that before doing actual join operations?

I'm not too familliar with how joins work at a very deep level, so I might be missing something obvious here. But if I was to write some kind of pseudocode for that join, I could make it more efficient than any of the queries I provided in the original post.

Or maybe 'on (...)' clauses with predicates that only use values from one of the tables are used so rarely that this optimization is just not worth the effort?
Chris Saxon
April 19, 2021 - 12:41 pm UTC

That's exactly what creating the index does! It enables the database to only access the rows that match the join criteria and ignore the rest.

How the database processes this depends on the join type. To understand these, watch this video I have explaining them:



This is taken from my free performance tuning class, Databases for Developers: Performance. If you want to learn more about SQL tuning, you can register for this at:

https://devgym.oracle.com/pls/apex/dg/class/databases-for-developers-performance.html

We seem to be talking about different things.

Mikhail Onishchenko, April 20, 2021 - 9:42 am UTC

We seem to be talking about different things.

I think you are talking about ignoring the rows from t_launches. Which I agree the index does there and greatly improves the performance.

I was talking about ignoring the rows from v_schedule (or t_schedule) in regards to my 3rd query.

Even if the index allows to ignore rows from t_launches the nested loop still has to look in the index for every row in v_schedule. I.e. 100 000 times.

But I know that this condition is always true: plan_date <= launch_date <= sysdate. Simply because launches cannot happen earlier than their planned time and they also cannot be in the future.

Ignoring the rows from v_schedule where plan_date <= sysdate (trunc(sysdate) + interval'1'minute * 1000 for the provided example) brings it down to 1000 rows. Reducing the work by a factor of 100.

As for the plan with merge join, funnily enough making v_schedule into a table and proper stats have ruined the plan. It works just as slow as the query did initially. I had to add use_nl and index_rs_asc to bring it back under a second.
Chris Saxon
April 20, 2021 - 4:21 pm UTC

So what you're wanting is:

The database to stop processing v_schedule when launch_date > sysdate?

The problem here is the database doesn't know there are no rows > sysdate in t_launches. It has to check t_launches to verify this. So it can't just stop processing the join for plan_dates after this.

As far as I'm aware, the optimizer won't currently use filters in the join clause to avoid the join as you want. It may be possible to add this, you'll need to add an enhancement request for this though.

So get what you want, you need to split the queries and union them as in the T2 query.

In practice it makes little difference to the example queries. While 100,000 index lookups vs. 1,000 sounds bad the runtime is similar. Most of the time is spent fetching the rows from the schedule:

select plan_date, launch_date
from (
  select  plan_date,
          lead(plan_date) over (order by plan_date) as next_plan_date
  from v_schedule
)
left join t_launches
on   plan_date <= to_date ( '2021-04-20 13:00', 'yyyy-mm-dd hh24:mi' )
and  launch_date >= plan_date
and  launch_date < next_plan_date;

------------------------------------------------------------------------------------------------------------        
| Id  | Operation                        | Name          | Starts | E-Rows | A-Rows |   A-Time   | Buffers |        
------------------------------------------------------------------------------------------------------------        
|   0 | SELECT STATEMENT                 |               |      1 |        |    100K|00:00:00.64 |    1030 |        
|   1 |  NESTED LOOPS OUTER              |               |      1 |      1 |    100K|00:00:00.64 |    1030 |        
|   2 |   VIEW                           |               |      1 |      1 |    100K|00:00:00.48 |       0 |        
|   3 |    WINDOW SORT                   |               |      1 |      1 |    100K|00:00:00.43 |       0 |        
|   4 |     VIEW                         | V_SCHEDULE    |      1 |      1 |    100K|00:00:00.19 |       0 |        
|   5 |      CONNECT BY WITHOUT FILTERING|               |      1 |        |    100K|00:00:00.06 |       0 |        
|   6 |       FAST DUAL                  |               |      1 |      1 |      1 |00:00:00.01 |       0 |        
|*  7 |   INDEX RANGE SCAN               | LAUNCH_DATE_I |    100K|      1 |    780 |00:00:00.11 |    1030 |        
------------------------------------------------------------------------------------------------------------

select plan_date, launch_date
from (
  select  plan_date,
          lead(plan_date) over (order by plan_date) as next_plan_date
  from v_schedule
  where plan_date <= to_date ( '2021-04-20 13:00', 'yyyy-mm-dd hh24:mi' )
)
left join t_launches
on   launch_date >= plan_date
and  launch_date < next_plan_date
union all
select plan_date, null
from   v_schedule
where  plan_date > to_date ( '2021-04-20 13:00', 'yyyy-mm-dd hh24:mi' );

-------------------------------------------------------------------------------------------------------------    
| Id  | Operation                         | Name          | Starts | E-Rows | A-Rows |   A-Time   | Buffers |    
-------------------------------------------------------------------------------------------------------------    
|   0 | SELECT STATEMENT                  |               |      1 |        |    100K|00:00:00.58 |      38 |    
|   1 |  UNION-ALL                        |               |      1 |        |    100K|00:00:00.58 |      38 |    
|   2 |   NESTED LOOPS OUTER              |               |      1 |      3 |    780 |00:00:00.19 |      38 |    
|   3 |    VIEW                           |               |      1 |      1 |    780 |00:00:00.19 |       0 |    
|   4 |     WINDOW SORT                   |               |      1 |      1 |    780 |00:00:00.19 |       0 |    
|*  5 |      VIEW                         | V_SCHEDULE    |      1 |      1 |    780 |00:00:00.19 |       0 |    
|   6 |       CONNECT BY WITHOUT FILTERING|               |      1 |        |    100K|00:00:00.06 |       0 |    
|   7 |        FAST DUAL                  |               |      1 |      1 |      1 |00:00:00.01 |       0 |    
|*  8 |    INDEX RANGE SCAN               | LAUNCH_DATE_I |    780 |      3 |    779 |00:00:00.01 |      38 |    
|*  9 |   VIEW                            | V_SCHEDULE    |      1 |      1 |  99220 |00:00:00.31 |       0 |    
|  10 |    CONNECT BY WITHOUT FILTERING   |               |      1 |        |    100K|00:00:00.10 |       0 |    
|  11 |     FAST DUAL                     |               |      1 |      1 |      1 |00:00:00.01 |       0 |    
-------------------------------------------------------------------------------------------------------------


(I swapped sysdate for 20 Apr to make the queries consistent)

Just joining takes 640ms vs 580ms for splitting the join on date. Though the union all approach does notably less IO (Buffers - 38 vs 1,030).

This can be greatly reduced by using a different join method, but there are several challenges with this query stopping this:

- The optimizer thinks v_schedule only returns one row
- You're outer joining v_schedule, so the database needs to return all the rows it produces
- There are no equality conditions (=) in the join

The optimizer prefers to start joins with the table returning fewer rows. So thinking that the larger table only returns one row stops this! This also leads it to think nested loops are best, as it believes there's only one indexed lookup to do on t_launches. This is clearly wrong :)

Outer joining also limits the join options. It can't use t_launches as the outer table in nested loops joins for example.

As all the join criteria are range conditions, the optimizer is unable to use a hash join. This would allow it to make t_launches the driving table and only scan both data sets once.

Adding an equality to the join (trunc ( launch_date ) = trunc ( plan_date ) - is this safe?) means it can hash join. I've also cheated and used the cardinlaity hint to tell the optimizer that v_schedule has 100,000 rows.

Avoid the cardinality hint in production code!

This flips the join order and allows the database to use a hash join:

select plan_date, launch_date
from (
  select  /*+ cardinality ( 100000 ) */plan_date,
          lead(plan_date) over (order by plan_date) as next_plan_date
  from v_schedule
) v
left join t_launches t
on   launch_date >= plan_date
and  launch_date < next_plan_date
and  trunc ( launch_date ) = trunc ( plan_date );

------------------------------------------------------------------------------------------------------------    
| Id  | Operation                        | Name          | Starts | E-Rows | A-Rows |   A-Time   | Buffers |    
------------------------------------------------------------------------------------------------------------    
|   0 | SELECT STATEMENT                 |               |      1 |        |    100K|00:00:00.74 |       7 |    
|*  1 |  HASH JOIN RIGHT OUTER           |               |      1 |    100K|    100K|00:00:00.74 |       7 |    
|   2 |   INDEX FAST FULL SCAN           | LAUNCH_DATE_I |      1 |   1000 |   1000 |00:00:00.01 |       7 |    
|   3 |   VIEW                           |               |      1 |    100K|    100K|00:00:00.45 |       0 |    
|   4 |    WINDOW SORT                   |               |      1 |    100K|    100K|00:00:00.41 |       0 |    
|   5 |     VIEW                         | V_SCHEDULE    |      1 |      1 |    100K|00:00:00.19 |       0 |    
|   6 |      CONNECT BY WITHOUT FILTERING|               |      1 |        |    100K|00:00:00.06 |       0 |    
|   7 |       FAST DUAL                  |               |      1 |      1 |      1 |00:00:00.01 |       0 |    
------------------------------------------------------------------------------------------------------------


Much less work, but slower than either of the original joins!

As for the plan with merge join, funnily enough making v_schedule into a table and proper stats have ruined the plan

Can you share exactly what you did?

More to Explore

Analytics

Analytic SQL got you confused? Check out Connor McDonald's complete video course.