Database, SQL and PL/SQL

On Joins and Query Plans

 

Our technologist optimizes joins and explains changing plans.

By Tom Kyte Oracle Employee ACE

May/June 2006

 

I know some of the rules about joins and that join 2 here is better than 1. Are there other rules about joins that I should know, particularly which are efficient to use and which are not?

1) SELECT distinct D.deptno, D.dname
    FROM dept D, emp E
        WHERE E.deptno = D.deptno
        ORDER BY D.deptno;
2) SELECT D.deptno, D.dname
    FROM dept D
        WHERE EXISTS
                 (
                 SELECT 1
                 FROM emp E
                 WHERE E.deptno = D.deptno
                 )
        ORDER BY D.deptno;

In general, you should phrase the queries in the manner that says it best. If one set of joins were particularly efficient to use in all cases, Oracle would not have implemented the rest of them!

In general, you use a join when you need data from more than one table in the ultimate SELECT list. Here you need data only from DEPT, so it is unlikely that you would consider a join. You would instead consider using either WHERE EXISTS or WHERE IN. (The cost-based optimizer [CBO] sees them as more or less equivalent—you might, too.)

I would write the query as either

select deptno, dname
  from dept
 where exists
   ( select NULL
       from emp
      where emp.deptno
         = dept.deptno )
 order by deptno;

or

select deptno, dname
  from dept
 where deptno in
( select deptno
    from emp )
 order by deptno;

In both cases, the optimizer would employ either a semi join, which you cannot specify but the optimizer may perform (it stops joining after the first hit), or an index probe using NESTED LOOPS into EMP to test for row existence.

This example demonstrates, in general, what happens. When the optimizer detects that DEPT is small and EMP is large, it index-probes the large EMP table for each row in DEPT. On the other hand, if it deems DEPT to be large, the optimizer will do a pure semi join in bulk, with no indexes. I'm using a copy of the EMP and DEPT tables for the test:

SQL> create table emp as
  2  select * from scott.emp;
Table created.
SQL> create table dept as
  2  select * from scott.dept;
Table created.
SQL> create index emp_deptno_idx
  2  on emp(deptno);
Index created.

Then I set the first representative statistics. I use SET_TABLE_STATS to make EMP appear large, with 1,000,000 rows, and DEPT appear small, with only 100 rows. Additionally, I tell the optimizer about the index I envision having on EMP(DEPTNO). Because Oracle Database 10g Release 2 computes statistics on an index creation by default, I start by removing any existing statistics on this index and putting in place statistics that might be more representative:

SQL> begin
  2   dbms_stats.set_table_stats
  3   ( user,
  4    'EMP',
  5    numrows => 1000000,
  6    numblks => 100000 );
  7   dbms_stats.delete_index_stats
  8   ( user,
  9    'EMP_DEPTNO_IDX' );
 10   dbms_stats.set_index_stats
 11   ( user,
 12    'EMP_DEPTNO_IDX',
 13     numrows => 1000000,
 14     numdist => 10000,
 15     numlblks =>10000 );
 16   dbms_stats.set_column_stats
 17   ( user,
 18    'EMP',
 19    'DEPTNO',
 20     DISTCNT => 10000 );
 21   dbms_stats.set_table_stats
 22   ( user,
 23    'DEPT',
 24     numrows=> 100,
 25     numblks => 100 );
 26  end;
 27  /

Now I'm ready to see what the optimizer decides to do. I use AUTOTRACE to review the query plans generated for two queries, either of which would be a correct way to ask this question. I'm showing two queries to demonstrate that when developers use the CBO, they don't have to decide on the best way to ask a question—the optimizer recognizes certain constructs and picks the best path possible. This is in contrast to the old rule-based optimizer (RBO), which would view WHERE IN and WHERE EXISTS very differently.

When using the CBO with a large EMP and a small DEPT, Oracle Database might use a plan such as the one in Listing 1. The optimizer opts to read every row in DEPT and then performs an index range scan—the index probe into the EMP table—to see if the row exists in the other table. Because I have relatively few index probes to perform (about 100 is what the optimizer thinks) and because the EMP table is large, the optimizer uses this plan in both the WHERE IN and WHERE EXISTS cases.

Code Listing 1: CBO plan with large EMP and small DEPT

SQL> set autotrace traceonly explain
SQL> select deptno, dname
  2    from dept where deptno  in
  3    ( select deptno
  4        from emp )
  5   order by deptno;
Execution Plan
-----------------------------------------------------------
Plan hash value: 3383088615
-----------------------------------------------------------
|  Id | Operation       | Name          |   Rows | Bytes  |
-----------------------------------------------------------
|   0 |SELECT STATEMENT |               |   100  |  3500  |
|   1 |SORT ORDER BY    |               |   100  |  3500  |
|   2 |NESTED LOOPS SEMI|               |   100  |  3500  |
|   3 |TABLE ACCESS FULL| DEPT          |   100  |  2200  |
|*  4 |INDEX RANGE SCAN | EMP_DEPTNO_IDX|  1000K |    12M |
-----------------------------------------------------------
Predicate Information (identified by operation id):
-----------------------------------------------------------
   4 - access("DEPTNO"="DEPTNO")
SQL> select deptno, dname
  2    from dept where exists
  3    ( select null
  4        from emp
  5       where emp.deptno =
  6            dept.deptno )
  7   order by deptno;
Execution Plan
------------------------------------------------------------
Plan hash value: 3383088615
------------------------------------------------------------
|  Id | Operation        | Name          |   Rows | Bytes  |
------------------------------------------------------------
|   0 | SELECT STATEMENT |               |   100  |  3500  |
|   1 | SORT ORDER BY    |               |   100  |  3500  |
|   2 | NESTED LOOPS SEMI|               |   100  |  3500  |
|   3 | TABLE ACCESS FULL| DEPT          |   100  |  2200  |
|*  4 | INDEX RANGE SCAN | EMP_DEPTNO_IDX|  1000K |    12M |
------------------------------------------------------------
Predicate Information (identified by operation id):
------------------------------------------------------------
   4 - access("EMP"."DEPTNO"="DEPT"."DEPTNO")

Compare that with the RBO plan, shown in Listing 2. When using the RBO, the developer must ask, "What size will EMP and DEPT be?" and, depending on the answer, choose either WHERE IN or WHERE EXISTS to get an "optimal" plan. The RBO comes up with two different plans based solely on how the question was asked, not on how large the amount of data was. This is one of the major advantages of the CBO over the RBO: Developers don't have to know how to write the best query in all cases—which is not to say that they don't need to learn SQL! They just do not have to figure out minutiae such as where they should use WHERE IN versus WHERE EXISTS.

Code Listing 2: RBO plan

SQL> select /*+ RULE */
  2    deptno, dname
  3    from dept where deptno  in
  4    ( select deptno
  5        from emp )
  6   order by deptno;
Execution Plan
-------------------------------------
Plan hash value: 1932208647
-------------------------------------
|  Id | Operation        | Name     |
-------------------------------------
|   0 | SELECT STATEMENT |          |
|   1 | MERGE JOIN       |          |
|   2 | SORT JOIN        |          |
|   3 | TABLE ACCESS FULL| DEPT     |
|*  4 | SORT JOIN        |          |
|   5 | VIEW             | VW_NSO_1 |
|   6 | SORT UNIQUE      |          |
|   7 | TABLE ACCESS FULL| EMP      |
-------------------------------------
Predicate Information (identified by operation id):
-------------------------------------
   4 - access("DEPTNO"="$nso_col_1")
       filter("DEPTNO"="$nso_col_1")
Note
----------
   - rule based optimizer used (consider using cbo)
SQL> select /*+ RULE */
  2    deptno, dname
  3    from dept where exists
  4    ( select null
  5        from emp
  6       where emp.deptno =
  7            dept.deptno )
  8   order by deptno;
Execution Plan
----------------------------------------
Plan hash value: 4109416194
----------------------------------------
| Id | Operation        | Name         |
----------------------------------------
|  0 | SELECT STATEMENT |              |
|  1 | SORT ORDER BY    |              |
|* 2 | FILTER           |              |
|  3 | TABLE ACCESS FULL| DEPT         |
|* 4 | INDEX RANGE SCAN | EMP_DEPTNO_ID|
----------------------------------------
Predicate Information (identified by operation id):
----------------------------------------
   2 - filter( EXISTS (SELECT 0 FROM "EMP" "EMP" WHERE
                "EMP"."DEPTNO"=:B1))
   4 - access("EMP"."DEPTNO"=:B1)

Now, back to the CBO. As the amount of data in DEPT increases—and the optimizer decides that it will have to perform significantly more index probes into the EMP table—the query plans change. They go from using NESTED LOOPS and INDEX RANGE SCANS to more-efficient bulk operations. To see this, I tell the optimizer that DEPT is much larger now

SQL> begin
  2     dbms_stats.set_table_stats
  3     ( user,
  4      'DEPT',
  5       numrows=> 100000,
  6       numblks => 10000 );
  7  end;
  8  /

and then rerun the queries and review the resulting plans, as shown in Listing 3. As you can see, the optimizer, in both cases, now chooses to perform full scans and a nice big hash semi join—a join that stops after the first match—instead of using an index probe, because it would have to use the index 100,000 times.

Code Listing 3: CBO plan with larger DEPT

SQL> set autotrace traceonly explain
SQL> select deptno, dname
  2    from dept where deptno  in
  3    ( select deptno
  4        from emp )
  5   order by deptno;
Execution Plan
-----------------------------------------------------------------
Plan hash value: 3127359958
-----------------------------------------------------------------
| Id |  Operation           | Name           |   Rows |  Bytes  |
-----------------------------------------------------------------
|  0 |  SELECT STATEMENT    |                |   100K |  3417K  |
|  1 |  SORT ORDER BY       |                |   100K |  3417K  |
|* 2 |  HASH JOIN SEMI      |                |   100K |  3417K  |
|  3 |  TABLE ACCESS FULL   | DEPT           |   100K |  2148K  |
|  4 |  INDEX FAST FULL SCAN| EMP_DEPTNO_IDX |   1000K|    12M  |
-----------------------------------------------------------------
Predicate Information (identified by operation id):
-----------------------------------------------------------------
   2 - access("DEPTNO"="DEPTNO")
SQL> select deptno, dname
  2    from dept where exists
  3    ( select null
  4        from emp
  5       where emp.deptno =
  6            dept.deptno )
  7   order by deptno;
Execution Plan
---------------------------------------------------------------
Plan hash value: 3127359958
---------------------------------------------------------------
| Id |  Operation           | Name          |   Rows |   Bytes|
---------------------------------------------------------------
|  0 |  SELECT STATEMENT    |               |   100K |  3417K |
|  1 |  SORT ORDER BY       |               |   100K |  3417K |
|* 2 |  HASH JOIN SEMI      |               |   100K |  3417K |
|  3 |  TABLE ACCESS FULL   | DEPT          |   100K |  2148K |
|  4 |  INDEX FAST FULL SCAN| EMP_DEPTNO_IDX|   1000K|    12M |
---------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------------------
   2 - access("EMP"."DEPTNO"="DEPT"."DEPTNO")

Another important thing to note about these examples is that the RBO never performed a semi join, as the CBO did. The RBO has a very limited set of access paths and join operations available—it simply doesn't know how to do a semi join and won't use that join operation. Instead it performs operations such as the SORT DISTINCT it did in Listing 2, which is much less efficient than the cost-based plan in this case. Another reason to abandon the RBO in favor of the CBO!

Why Does My Plan Change?

I have noticed that gathering statistics against tables where none of the data has changed can cause the query plans against those tables to change. For example, one day I gather statistics, run some queries, and save the query plans in a report. The next day, I gather statistics (but I have not modified any of the data in any table) and I discover that the plans are different. What could cause that? The new query plans were better, but I would like to understand how this could be.

This change is likely due to the default value of the METHOD_OPT parameter used with DBMS_STATS. In Oracle Database 10g, the METHOD_OPT parameter defaults to a SIZE AUTO. After you ran a query, the database remembered the predicates and updated a dictionary table, SYS.COL_USAGE$. Then, the next time you ran DBMS_STATS to gather statistics on these tables, DBMS_STATS queried that table to find out what columns should have histograms collected automatically, based on past query workload. It looked at your predicates and said, "Hmm, these columns are candidates for histograms based on the queries the end users have been running."

You can see this easily with a relatively small example. I start out by creating a small table with some skewed data in a column named ID and gathering the default set of statistics on it. The data in the ID column is such that for the values 0 through 4, about 20 percent of the table will be retrieved, but for the value 99, only a single row will be returned. I'm using this exaggerated example just to see the plans change:

SQL> create table t
  2  as
  3  select mod(rownum,5) id, a.*
  4    from all_objects a;
Table created.
SQL> update t
  2     set id = 99
  3   where rownum = 1;
1 row updated.
SQL> create index t_idx on t(id);
Index created.
SQL> begin
  2   dbms_stats.gather_table_stats
  3   ( user, 'T' );
  4  end;
  5  /
SQL> select column_name, count(*)
  2    from user_tab_histograms
  3   where table_name = 'T'
  4     and column_name = 'ID'
  5   group by column_name;
COLUMN_NAME   COUNT(*)
-----------   ------------
ID                  2

So, right now, the ID column doesn't have complete histograms, by default—just two buckets, as shown by the query against USER_TAB_HISTOGRAMS. To understand the skewed nature of the data, I need a histogram with more than just two buckets; the two entries in USER_TAB_HISTOGRAMS tell the optimizer only the high and low values right now.

The optimizer knows the high value (99), the low value (0), the number of distinct values (6 in this case), and the number of rows in my table T (50,119 when I tested). Given those facts, the optimizer will believe that WHERE ID = <value> will return about 50,119/6, or 8,353 rows. Sure enough, when I query with ID=1 or ID=99, I observe the results in Listing 4.

Code Listing 4: CBO plans when ID=1 and ID=99

SQL> set autotrace traceonly explain
SQL> select *
  2    from t
  3   where id = 1;
----------------------------------------------------------------------------
|  Id | Operation        | Name |  Rows |  Bytes|   Cost (%CPU) |  Time    |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT |      |  8353 |   783K|   163 (2)     | 00:00:02 |
|*  1 | TABLE ACCESS FULL| T    |  8353 |   783K|   163 (2)     | 00:00:02 |
----------------------------------------------------------------------------
SQL> select *
  2    from t
  3   where id = 99;
Execution Plan
----------------------------------------------------------------------------
Plan hash value: 1601196873
----------------------------------------------------------------------------
|  Id | Operation        | Name |   Rows |  Bytes|  Cost (%CPU) | Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT |      |   8353 |   783K|   163 (2)    | 00:00:02 |
|*  1 | TABLE ACCESS FULL| T    |   8353 |   783K|   163 (2)    | 00:00:02 |
----------------------------------------------------------------------------

Now, I immediately gather statistics, using the same command as before:

SQL> begin
  2   dbms_stats.gather_table_stats
  3   ( user, 'T' );
  4  end;
  5  /
SQL> select column_name, count(*)
  2    from user_tab_histograms
  3   where table_name = 'T'
  4     and column_name = 'ID'
  5   group by column_name;
COLUMN_NAME   COUNT(*)
------------  -----------
ID                  5

Note, however, that I have more than two buckets for my histograms. DBMS_STATS, using the AUTO setting, gathered more information here. If I were to query the SYS.COL_USAGE$ table right now, I would discover that a new row was added, indicating that I have some queries in my system that use equality predicates against this particular database column. That is the "magic" that caused DBMS_STATS to change how it gathered statistics and caused a dramatic change in my resulting query plans, as shown in Listing 5.

Code Listing 5: New CBO plans when ID=1 and ID=99

  2    from t
  3   where id = 1;
Execution Plan
----------------------------------------------------------------------------
Plan hash value: 1601196873
----------------------------------------------------------------------------
| Id |  Operation        | Name |  Rows |  Bytes|  Cost (%CPU)  | Time     |
----------------------------------------------------------------------------
|  0 |  SELECT STATEMENT |      |  10260|   961K|   164 (2)     | 00:00:02 |
|* 1 |  TABLE ACCESS FULL| T    |  10260|   961K|   164 (2)     | 00:00:02 |
----------------------------------------------------------------------------
SQL> select *
  2    from t
  3   where id = 99;
Execution Plan
----------------------------------------------------------------------------------------
Plan hash value: 470836197
----------------------------------------------------------------------------------------
| Id |  Operation                  | Name    | Rows |  Bytes|  Cost (%CPU)  |  Time    |
----------------------------------------------------------------------------------------
|  0 |  SELECT STATEMENT           |         | 1    |    96     |   2   (0) | 00:00:01 |
|  1 |  TABLE ACCESS BY INDEX ROWID|    T    | 1    |    96     |   2   (0) | 00:00:01 |
|* 2 |  INDEX RANGE SCAN           |    T_IDX| 1    |           |   1   (0) | 00:00:01 |
----------------------------------------------------------------------------------------

Note how the cardinalities are very different—8,353 changed to either 10,260 or 1. The resulting change in estimated cardinality led to a change in the overall cost of the first query plan and resulted in an entirely different (but better) query plan in the second case.

This is a significant fact for a DBA to be aware of. As the queries submitted by your end users change over time, the AUTO feature of statistics gathering could change what statistics are gathered over time as well. Unless you understand this feature, enabled by default in Oracle Database 10g, it can look as if inexplicable magic is happening in your database.

Next Steps

ASK Tom
Oracle Vice President Tom Kyte answers your most difficult technology questions. Highlights from that forum appear in this column.
 asktom.oracle.com

READ more Tom
Expert Oracle Database Architecture: 9i and 10g Programming Techniques and Solutions

LEARN more about the optimizer
 Oracle Database Performance Tuning Guide

 DOWNLOAD Oracle Database 10g Express Edition (Oracle Database XE)



 

DISCLAIMER: We've captured these popular historical magazine articles for your reference. Links etc contained within these article possibly won't work. These articles are provided as-is with no guarantee that the information within them is the best advice for current versions of the Oracle Database.