Hi Connor,
Thanks again for all the great work you and Chris are doing for us.
We sometimes struggle with the optimizer decisions. In the following case I narrowed the problem down to this subquery:
SELECT i.hw_oid32,
hw_oid_lo,
i.hw_version,
i.hw_category,
i.hw_score,
i.hw_source,
p.hw_user_or_group,
p.hw_type,
i.hw_iterator_index
FROM hw_iterators i
inner join hw_permissions p USING (hw_oid_lo)
inner join hw_effective_groups g
ON ( p.hw_user_or_group = g.hw_groupname
AND p.hw_type = g.hw_type )
WHERE i.hw_iterator_id = :p_iterator_id
AND i.hw_iterator_id_internal = :v_int_id
AND g.hw_session_id = :p_we_session_id
set autotrace traceonly showed the following execution plan:
Execution Plan
----------------------------------------------------------
Plan hash value: 570251651
--------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 422 | 37 (0)| 00:00:01 |
| 1 | NESTED LOOPS | | 1 | 422 | 37 (0)| 00:00:01 |
| 2 | MERGE JOIN CARTESIAN | | 1 | 401 | 36 (0)| 00:00:01 |
| 3 | TABLE ACCESS BY INDEX ROWID BATCHED | HW_EFFECTIVE_GROUPS | 1 | 284 | 34 (0)| 00:00:01 |
|* 4 | INDEX RANGE SCAN | HW_EGRP_IND1 | 1 | | 1 (0)| 00:00:01 |
| 5 | BUFFER SORT | | 1 | 117 | 2 (0)| 00:00:01 |
| 6 | TABLE ACCESS BY INDEX ROWID BATCHED| HW_ITERATORS | 1 | 117 | 2 (0)| 00:00:01 |
|* 7 | INDEX RANGE SCAN | HW_ITER_IND1 | 1 | | 1 (0)| 00:00:01 |
|* 8 | INDEX UNIQUE SCAN | HW_PER_PK | 1 | 21 | 1 (0)| 00:00:01 |
--------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - access("G"."HW_SESSION_ID"=TO_NUMBER(:P_WE_SESSION_ID))
7 - access("I"."HW_ITERATOR_ID"=TO_NUMBER(:P_ITERATOR_ID) AND
"I"."HW_ITERATOR_ID_INTERNAL"=TO_NUMBER(:V_INT_ID))
8 - access("I"."HW_OID_LO"="P"."HW_OID_LO" AND "P"."HW_USER_OR_GROUP"="G"."HW_GROUPNAME" AND
"P"."HW_TYPE"="G"."HW_TYPE")
Note
-----
- dynamic statistics used: dynamic sampling (level=2)
Statistics
----------------------------------------------------------
258 recursive calls
0 db block gets
1792277 consistent gets
550937 physical reads
160 redo size
247537 bytes sent via SQL*Net to client
5040 bytes received via SQL*Net from client
405 SQL*Net roundtrips to/from client
27 sorts (memory)
0 sorts (disk)
6050 rows processed
The problem is that the optimizer decides to start with the hw_effective_groups table and then tries to join it with the hw_iterators table but there are no join predicates for that! On the other hand, if I specify a leading(i) hint the execution plan looks like this:
Execution Plan
----------------------------------------------------------
Plan hash value: 3948157368
-------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 422 | 39 (0)| 00:00:01 |
|* 1 | HASH JOIN | | 1 | 422 | 39 (0)| 00:00:01 |
| 2 | NESTED LOOPS | | 4 | 552 | 5 (0)| 00:00:01 |
| 3 | TABLE ACCESS BY INDEX ROWID BATCHED| HW_ITERATORS | 1 | 117 | 3 (0)| 00:00:01 |
|* 4 | INDEX RANGE SCAN | HW_ITER_IND1 | 1 | | 2 (0)| 00:00:01 |
|* 5 | INDEX RANGE SCAN | HW_PER_PK | 3 | 63 | 2 (0)| 00:00:01 |
| 6 | TABLE ACCESS BY INDEX ROWID BATCHED | HW_EFFECTIVE_GROUPS | 1 | 284 | 34 (0)| 00:00:01 |
|* 7 | INDEX RANGE SCAN | HW_EGRP_IND1 | 1 | | 1 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("P"."HW_USER_OR_GROUP"="G"."HW_GROUPNAME" AND "P"."HW_TYPE"="G"."HW_TYPE")
4 - access("I"."HW_ITERATOR_ID"=TO_NUMBER(:P_ITERATOR_ID) AND
"I"."HW_ITERATOR_ID_INTERNAL"=TO_NUMBER(:V_INT_ID))
5 - access("I"."HW_OID_LO"="P"."HW_OID_LO")
7 - access("G"."HW_SESSION_ID"=TO_NUMBER(:P_WE_SESSION_ID))
Note
-----
- dynamic statistics used: dynamic sampling (level=2)
Statistics
----------------------------------------------------------
105 recursive calls
0 db block gets
6751 consistent gets
2101 physical reads
140 redo size
275473 bytes sent via SQL*Net to client
5040 bytes received via SQL*Net from client
405 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
6050 rows processed
Some information about the involved tables: hw_permissions is a rather big table (IOT) with about 9 million rows with current statistics. On the other hand, hw_iterators and hw_effective_groups are normal (heap) tables but since they are very volatile and statistics would always be wrong I decided to delete and lock the table stats. This way, I was hoping that the optimizer would have to use dynamic sampling (which it does apparently). In my test case, the number of rows in the hw_iterators table was about 10000, a third of which matches the :p_iterator_id/:v_int_id. The number of rows in the hw_effective_groups table was a little over 400, 381 matched the :p_we_session_id.
Although the test case ran on an otherwise idle system, it seems to be reproducible. The number of rows in the hw_iterators table changes fairly quickly but will always be (a lot) smaller than the number of rows in hw_permissions. The same is true for hw_effective_groups which might grow to a couple hundreds of thousands but the subset in this query (:p_we_session_id) will usually be less than 1000.
So my question is, why on earth does the optimizer choose a join of two tables with no common join predicates?
BTW, this is happening on a test database running:
Oracle Database 12c Standard Edition Release 12.2.0.1.0 - 64bit Production
The create statements for the tables and indexes are on livesql and also some statements to load the tables. I just can't run it on livesql because of quota limitations.
The optimizer thinks the query will return just 1 row from HW_EFFECTIVE_GROUPS (the rows column for lines 3 & 4 in the first plan).
As the docs say, it can choose a merge join Cartesian if:
A Cartesian join is an efficient method.
For example, the optimizer may decide to generate a Cartesian product of two very small tables that are both joined to the same large table.
https://docs.oracle.com/en/database/oracle/oracle-database/19/tgsql/joins.html#GUID-11E7100E-1316-4963-83C5-A85940BE9BB6 If the optimizer thinks it'll get (at most) one row from a table, this is fast - just combine it with every row from the other table. This only really works if you only get one row from the first table though!
Also bear in mind - you're working with
explain plans. These are predictions about what
might happen. To see what really happened, you want the
execution plan.
https://blogs.oracle.com/sql/how-to-create-an-execution-plan Getting this using your Live SQL examples, I get this plan:
--------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows |
--------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 131 (100)| 3999 |
| 1 | NESTED LOOPS | | 1 | 2014K| 131 (87)| 3999 |
| 2 | MERGE JOIN CARTESIAN | | 1 | 2014K| 18 (0)| 2015K|
| 3 | TABLE ACCESS BY INDEX ROWID BATCHED | HW_ITERATORS | 1 | 3997 | 18 (0)| 3999 |
|* 4 | INDEX RANGE SCAN | HW_ITER_IND1 | 1 | 3997 | 18 (0)| 3999 |
| 5 | BUFFER SORT | | 3999 | 504 | 0 (0)| 2015K|
|* 6 | TABLE ACCESS BY INDEX ROWID BATCHED| HW_EFFECTIVE_GROUPS | 1 | 504 | 0 (0)| 504 |
|* 7 | INDEX RANGE SCAN | HW_EGRP_IND1 | 1 | 504 | 0 (0)| 504 |
|* 8 | INDEX UNIQUE SCAN | HW_PER_PK | 2015K| 1 | 0 (0)| 3999 |
--------------------------------------------------------------------------------------------------------------
The row estimate for HW_ITERATORS is now much higher (~4k) and pretty accurate. But it's still choosing a merge join!
I'm not sure exactly why this is - this looks like a costing error to me. Notice that the cost for lines 5-8 is zero. I think it relates to the missing stats.
After gathering stats on all the tables I get this much more reasonable looking plan:
------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows |
------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 8027 (100)| 3999 |
|* 1 | HASH JOIN | | 1 | 501K| 8027 (1)| 3999 |
|* 2 | TABLE ACCESS BY INDEX ROWID BATCHED| HW_EFFECTIVE_GROUPS | 1 | 504 | 5 (0)| 504 |
|* 3 | INDEX RANGE SCAN | HW_EGRP_IND1 | 1 | 504 | 2 (0)| 504 |
| 4 | NESTED LOOPS | | 1 | 533K| 8020 (1)| 535K|
|* 5 | TABLE ACCESS FULL | HW_ITERATORS | 1 | 3998 | 19 (0)| 3999 |
|* 6 | INDEX RANGE SCAN | HW_PER_PK | 3999 | 133 | 2 (0)| 535K|
------------------------------------------------------------------------------------------------------------
So rather than deleting the stats entirely then locking them, I suggest gathering stats on the volatile tables when they have (as best you can get) representative data in them.
Remember the goal isn't to have perfect stats, just stats good enough for the optimizer to choose a good plan.
You can still use dynamic sampling (now called dynamic stats) by setting OPTIMIZER_DYNAMIC_SAMPLING in the session or using the dynamic_sampling hint too.