Hi Tom – long time lurker, first time submitter.
Running 11.2.0.3 on HP-UX ia64 B.11.31
This question may seem a bit off the wall on the surface – in fact I’m struggling to find the find the right way to ask it, but here goes anyways…..
I’m wondering if there is any way, or if Oracle has ever considered adding the ability to make the optimizer aware of sets of logical conditions in data which always lead to other conditions. That is to say that if conditions A, B, and C are true, then condition D must always be true as well. To take that a step further, have the optimizer dynamically add a predicate or predicates to queries based on those conditions which would improve their performance.
Here’s the reason I ask:
I’m a DBA working in a data warehouse environment. The data we work with is primarily related to historical claim transactions. Our largest table (by far) contains claim data going back to 1977. That table is partitioned on a column named “paid_date" which represents the date the claim was finalized in the source system (regardless of whether the claim was paid or denied). Several times over the years, we’ve studied whether or not the paid_date column is the right column to use for partitioning, and the answer has always been “yes”.
Most often when a user develops a query, they are logically interested in only recently paid claims, say within the last two years or so. When users include a predicate on the paid_date column, and they quite often do, then all is right in our data warehouse world. However, if they neglect to include a predicate on the paid_date column when the predicate would not logically affect the results, then their queries frequently result in execution plans which use full table scans of all of the partitions rather than employing partition pruning. Because of other predicates, the rows returned will all come from those more recent partitions (that is, logically, they get the same results), but the database has to do a lot more work to get to those same results. The difference in the number of rows for the last two years and the number of rows for 35 years is significant.
As a human being familiar with the characteristics of the data in the table, I know the following to be true:
If
1) the value of the “detail_status_code” column is ‘P’ (for “paid”) and
2) the value of the “active_ind” column is ‘Y’
Then
I know for a fact that the value of the “paid_date” column must be equal to or greater than the value of the “first_date_of_service”.
I know this to be true because in order for a claim to be paid, the dates of service on the claim cannot be in the future (the source system performs edits which will deny claims submitted with dates of service in the future), and because a query designed to test that hypothesis shows it to be true! I also know that many queries submitted against that table include predicates on those three key columns (detail_status_code, active_ind, and first_date_of_service).
No matter how often we train and exhort the user community to include a predicate on the paid_date column whenever the logic of their query will allow it, they still frequently omit that precious predicate. This is somewhat understandable since when constructing queries users tend to be more focused on defining predicates which answer the question they are trying to ask than they are on query performance. Therefore, time and again we see queries running which could take advantage of partitioning had the user included a predicate on the paid_date column where that predicate would not impact the logic of the query.
This will be obvious to you, but here’s an example of the difference a predicate on the partition key column makes:
SQL> explain plan for
select
count(*)
from
claim_analysis
where
active_ind = 'Y' and
detail_status_code = 'P' and
first_date_of_service between to_date('20120101','YYYYMMDD') and to_date('20121231','YYYYMMDD');
Explained.
SQL> @$ORACLE_HOME/rdbms/admin/utlxpls
PLAN_TABLE_OUTPUT
Plan hash value: 2313963445
---------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
---------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 12 | 1760K (1)| 02:00:25 | | |
| 1 | SORT AGGREGATE | | 1 | 12 | | | | |
| 2 | PX COORDINATOR | | | | | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10000 | 1 | 12 | | | | |
| 4 | SORT AGGREGATE | | 1 | 12 | | | | |
| 5 | PX BLOCK ITERATOR | | 34M| 397M| 1760K (1)| 02:00:25 | 1 | 42 |
|* 6 | TABLE ACCESS FULL| CLAIM_ANALYSIS | 34M| 397M| 1760K (1)| 02:00:25 | 1 | 42 |
---------------------------------------------------------------------------------------------------------
The results and elapsed time for the execution of this query are:
COUNT(*)
----------
40312681
Elapsed: 00:14:16.12
Now the plan for the same query after adding the paid_date predicate:
SQL> explain plan for
select
count(*)
from
claim_analysis
where
active_ind = 'Y' and
detail_status_code = 'P' and
first_date_of_service between to_date('20120101','YYYYMMDD') and to_date('20121231','YYYYMMDD') and
paid_date >= to_date('20120101','YYYYMMDD');
Explained.
SQL> @$ORACLE_HOME/rdbms/admin/utlxpls
PLAN_TABLE_OUTPUT
Plan hash value: 876324039
---------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
---------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 20 | 163K (1)| 00:11:11 | | |
| 1 | SORT AGGREGATE | | 1 | 20 | | | | |
| 2 | PX COORDINATOR | | | | | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10000 | 1 | 20 | | | | |
| 4 | SORT AGGREGATE | | 1 | 20 | | | | |
| 5 | PX BLOCK ITERATOR | | 42M| 802M| 163K (1)| 00:11:11 | 30 | 42 |
|* 6 | TABLE ACCESS FULL| CLAIM_ANALYSIS | 42M| 802M| 163K (1)| 00:11:11 | 30 | 42 |
---------------------------------------------------------------------------------------------------------
The results and elapsed time for this query are:
COUNT(*)
----------
40312681
Elapsed: 00:04:27.58
Of course, the most telling difference between the plans lies in the number of partitions required. The first query must look through all 42 partitions. The second query can ignore the first 29 partitions. (Note that for older data there is one partition per year, and for the most recent four years there are six partitions per year). And, of course, the execution time is significantly improved when the additional predicate is included (14:16 versus 04:27), yet the results of the query are identical.
So, to reiterate my question:
“Is there a way to, or has Oracle ever considered delivering a method that would allow humans to impart knowledge about data to the optimizer that the optimizer doesn’t derive from statistics?”
I guess I’m visualizing something that would be similar to but expand upon the way that VPD operates – enforcing a rule that has been defined in the database by applying an additional predicate or predicates to queries that have certain characteristics. Although in this case the rule would be applied to enhance performance rather than to enforce security.
In the case of the simple example query that I’ve provided, the characteristics of interest would be:
1). The query selects from a specific object (the table named “claim_analysis” in a specific schema)
2). The predicates supplied in the submitted query include:
-- detail_status_code = ‘P’
-- active_ind = ‘Y’
-- a filter based on the first_date_of_service column
3). The predicates do not include a predicate on the paid_date column
And the predicate to be “tacked on” to the submitted query would be:
and paid_date >= the earliest value for the first_date_of_service that was included in the submitted query
I hope this question isn’t too bizarre for this forum, and there may very well be a way that I’m not aware of to get these same results without additional database capabilities. And, of course, I know that the simple answer is to force behavioral changes from our users, but human behavior is fickle. I think you can see how such a capability could add value.
Thanks for your time, and let me say that I really appreciate your work on askTom – you provide extremely valuable information, and your patience is beyond virtuous!