Skip to Main Content
  • Questions
  • Impart human knowledge to the optimizer?

Breadcrumb

May 4th

Question and Answer

Tom Kyte

Thanks for the question, Dan.

Asked: February 26, 2013 - 8:07 am UTC

Last updated: March 05, 2013 - 11:49 am UTC

Version: 11.2.0.3

Viewed 1000+ times

You Asked

Hi Tom – long time lurker, first time submitter.

Running 11.2.0.3 on HP-UX ia64 B.11.31

This question may seem a bit off the wall on the surface – in fact I’m struggling to find the find the right way to ask it, but here goes anyways…..

I’m wondering if there is any way, or if Oracle has ever considered adding the ability to make the optimizer aware of sets of logical conditions in data which always lead to other conditions. That is to say that if conditions A, B, and C are true, then condition D must always be true as well. To take that a step further, have the optimizer dynamically add a predicate or predicates to queries based on those conditions which would improve their performance.

Here’s the reason I ask:

I’m a DBA working in a data warehouse environment. The data we work with is primarily related to historical claim transactions. Our largest table (by far) contains claim data going back to 1977. That table is partitioned on a column named “paid_date" which represents the date the claim was finalized in the source system (regardless of whether the claim was paid or denied). Several times over the years, we’ve studied whether or not the paid_date column is the right column to use for partitioning, and the answer has always been “yes”.

Most often when a user develops a query, they are logically interested in only recently paid claims, say within the last two years or so. When users include a predicate on the paid_date column, and they quite often do, then all is right in our data warehouse world. However, if they neglect to include a predicate on the paid_date column when the predicate would not logically affect the results, then their queries frequently result in execution plans which use full table scans of all of the partitions rather than employing partition pruning. Because of other predicates, the rows returned will all come from those more recent partitions (that is, logically, they get the same results), but the database has to do a lot more work to get to those same results. The difference in the number of rows for the last two years and the number of rows for 35 years is significant.

As a human being familiar with the characteristics of the data in the table, I know the following to be true:

If
1) the value of the “detail_status_code” column is ‘P’ (for “paid”) and
2) the value of the “active_ind” column is ‘Y’
Then
I know for a fact that the value of the “paid_date” column must be equal to or greater than the value of the “first_date_of_service”.

I know this to be true because in order for a claim to be paid, the dates of service on the claim cannot be in the future (the source system performs edits which will deny claims submitted with dates of service in the future), and because a query designed to test that hypothesis shows it to be true! I also know that many queries submitted against that table include predicates on those three key columns (detail_status_code, active_ind, and first_date_of_service).

No matter how often we train and exhort the user community to include a predicate on the paid_date column whenever the logic of their query will allow it, they still frequently omit that precious predicate. This is somewhat understandable since when constructing queries users tend to be more focused on defining predicates which answer the question they are trying to ask than they are on query performance. Therefore, time and again we see queries running which could take advantage of partitioning had the user included a predicate on the paid_date column where that predicate would not impact the logic of the query.

This will be obvious to you, but here’s an example of the difference a predicate on the partition key column makes:

 SQL> explain plan for
select   
   count(*)
from
   claim_analysis
where
   active_ind = 'Y' and
   detail_status_code = 'P' and
   first_date_of_service between  to_date('20120101','YYYYMMDD') and to_date('20121231','YYYYMMDD');

Explained.

SQL> @$ORACLE_HOME/rdbms/admin/utlxpls

PLAN_TABLE_OUTPUT
Plan hash value: 2313963445

---------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name           | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |                |     1 |    12 |  1760K  (1)| 02:00:25 |       |       |
|   1 |  SORT AGGREGATE        |                |     1 |    12 |            |          |       |       |
|   2 |   PX COORDINATOR       |                |       |       |            |          |       |       |
|   3 |    PX SEND QC (RANDOM) | :TQ10000       |     1 |    12 |            |          |       |       |
|   4 |     SORT AGGREGATE     |                |     1 |    12 |            |          |       |       |
|   5 |      PX BLOCK ITERATOR |                |    34M|   397M|  1760K  (1)| 02:00:25 |     1 |    42 |
|*  6 |       TABLE ACCESS FULL| CLAIM_ANALYSIS |    34M|   397M|  1760K  (1)| 02:00:25 |     1 |    42 |
---------------------------------------------------------------------------------------------------------


The results and elapsed time for the execution of this query are:

COUNT(*)
----------
40312681

Elapsed: 00:14:16.12

Now the plan for the same query after adding the paid_date predicate:

SQL> explain plan for
select   
   count(*)
from
   claim_analysis
where
    active_ind = 'Y' and
    detail_status_code = 'P' and
    first_date_of_service between  to_date('20120101','YYYYMMDD') and to_date('20121231','YYYYMMDD') and
    paid_date >= to_date('20120101','YYYYMMDD');

Explained.

SQL> @$ORACLE_HOME/rdbms/admin/utlxpls

PLAN_TABLE_OUTPUT

Plan hash value: 876324039

---------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name           | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |                |     1 |    20 |   163K  (1)| 00:11:11 |       |       |
|   1 |  SORT AGGREGATE        |                |     1 |    20 |            |          |       |       |
|   2 |   PX COORDINATOR       |                |       |       |            |          |       |       |
|   3 |    PX SEND QC (RANDOM) | :TQ10000       |     1 |    20 |            |          |       |       |
|   4 |     SORT AGGREGATE     |                |     1 |    20 |            |          |       |       |
|   5 |      PX BLOCK ITERATOR |                |    42M|   802M|   163K  (1)| 00:11:11 |    30 |    42 |
|*  6 |       TABLE ACCESS FULL| CLAIM_ANALYSIS |    42M|   802M|   163K  (1)| 00:11:11 |    30 |    42 |
---------------------------------------------------------------------------------------------------------


The results and elapsed time for this query are:

COUNT(*)
----------
40312681

Elapsed: 00:04:27.58

Of course, the most telling difference between the plans lies in the number of partitions required. The first query must look through all 42 partitions. The second query can ignore the first 29 partitions. (Note that for older data there is one partition per year, and for the most recent four years there are six partitions per year). And, of course, the execution time is significantly improved when the additional predicate is included (14:16 versus 04:27), yet the results of the query are identical.

So, to reiterate my question:
“Is there a way to, or has Oracle ever considered delivering a method that would allow humans to impart knowledge about data to the optimizer that the optimizer doesn’t derive from statistics?”

I guess I’m visualizing something that would be similar to but expand upon the way that VPD operates – enforcing a rule that has been defined in the database by applying an additional predicate or predicates to queries that have certain characteristics. Although in this case the rule would be applied to enhance performance rather than to enforce security.

In the case of the simple example query that I’ve provided, the characteristics of interest would be:

1). The query selects from a specific object (the table named “claim_analysis” in a specific schema)
2). The predicates supplied in the submitted query include:
-- detail_status_code = ‘P’
-- active_ind = ‘Y’
-- a filter based on the first_date_of_service column
3). The predicates do not include a predicate on the paid_date column

And the predicate to be “tacked on” to the submitted query would be:

and paid_date >= the earliest value for the first_date_of_service that was included in the submitted query

I hope this question isn’t too bizarre for this forum, and there may very well be a way that I’m not aware of to get these same results without additional database capabilities. And, of course, I know that the simple answer is to force behavioral changes from our users, but human behavior is fickle. I think you can see how such a capability could add value.

Thanks for your time, and let me say that I really appreciate your work on askTom – you provide extremely valuable information, and your patience is beyond virtuous!

and Tom said...

“Is there a way to, or has Oracle ever considered delivering a method that would allow humans to impart knowledge about data to the optimizer that the optimizer doesn’t derive from statistics?”

yes, there is. Constraints - primary key, foreign key, check and not null constraints are used to rewrite queries all of the time. CREATE DIMENSION describes how to use fact tables to rewrite queries (implicit relationships between attributes in a time dimension for example, the fact that DAY implies MONTH_YEAR implies YEAR and also that DAY implies FISCAL_YEAR_MONTH_YEAR and that implies FISCAL_YEAR).

Here is an example:
http://asktom.oracle.com/Misc/stuck-in-rut.html

but that said, there is currently no way to assert your fact into the database. A check constraint would be currently too complex to rewrite with (lots of Or's would be involved in it)


so the answer to your question is yes, but, the exact solution to your problem does not yet exist. Things are always changing - so there may be some facility for this in future releases - but not currently.

Rating

  (5 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Some time in the future...

Toon Koppelaars, February 26, 2013 - 10:01 am UTC


Your condition:

(detail_status_code = 'P' and active_ind = 'Y') implies (paid_date >= first_date_of_service)

Can declaratively be implemented with following CHECK constraint:

alter table claims_analytics add check
(detail_status_code != 'P' or active_ind != 'Y' or paid_date >= first_date_of_service)
/


A smart cost based optimizer could use such a constraint by injecting its text into the where-clause of a query.
So for instance a query such as:

select *
from claims_analytics
where detail_status_code = 'P' and active_ind = 'Y' and first_date_of_service > sysdate - 60


Is semantically equivalent to the following query, which has the constraint predicate injected into it:

select *
from claims_analytics
where detail_status_code = 'P' and active_ind = 'Y' and first_date_of_service > sysdate - 60
-- Injected by optimizer
  and (detail_status_code != 'P' or active_ind != 'Y' or paid_date >= first_date_of_service)


This WHERE clause can then be rewritten following a couple of standard rewrite rules of logic.

1) Rewrite it using conjunction(and)-distribution over disjunction(or) into:

where (detail_status_code = 'P' and active_ind = 'Y' and first_date_of_service > sysdate - 60 and detail_status_code != 'P')
   or
      (detail_status_code = 'P' and active_ind = 'Y' and first_date_of_service > sysdate - 60 and active_ind != 'Y')
   or
      (detail_status_code = 'P' and active_ind = 'Y' and first_date_of_service > sysdate - 60 and paid_date >= first_date_of_service)


This then is equivalent to (using, ['P' != 'P'] is false, ['Y' != 'Y'] is false, [anything and false] is false):

where (false)
   or
      (false)
   or
      (detail_status_code = 'P' and active_ind = 'Y' and first_date_of_service > sysdate - 60 and paid_date >= first_date_of_service)


Which is equivalent to (using, [false or anything] is anything):

where detail_status_code = 'P' and active_ind = 'Y' and first_date_of_service > sysdate - 60 and paid_date >= first_date_of_service


Which is equivalent to (transitive closure over unbounded range predicates):

where detail_status_code = 'P'
  and active_ind = 'Y'
  and first_date_of_service > sysdate - 60
  and paid_date >= first_date_of_service
  and paid_date > sysdate - 60


Which then enables the optimizer to apply partition pruning.

I have presented on this topic of "Semantic Query Optimization" at Hotsos 2008.

There are quite a few use-cases like yours here, that the CBO cannot deal with yet, but theoretically should in principle be able to deal with.

Maybe in some future Oracle release...

Rule Based Optimizer

S, February 26, 2013 - 11:55 am UTC

Shouldn’t these kinds of transformations be a part of the rule based optimizer? Not a single reference to the size or distribution of data was made in Toon's example.

And, of course, the DBMS should allow you to specify constraints of arbitrary complexity.

S

Tom Kyte
February 26, 2013 - 12:42 pm UTC

the rule based optimizer did no translations really. It takes a hugely sophisticated rewrite to perform some of these.

On a simple query, it looks - so simple. On complex queries (what we really have to deal with) it becomes very complex - not so clear.


the dbms allows you to specify column and row level constraints of any complexity - it just cannot use some of them to rewrite queries.

but on the other hand, it can and does use MANY of them to rewrite queries.

It just doesn't do every possible rewrite a human being sitting there looking at a query could do.

On the other hand, it does rewrites that seem like magic - ones that a developer having access to the constraints would not think to do (sometimes the software is smarter than the coder - oftentimes actually).


Here is a quick example of what I mean.

Did you know that the query:

ops$tkyte%ORA11GR2> SELECT COUNT(*)
  2    FROM T1, T2, T3
  3   WHERE T2.order_id = T1.order_id
  4     AND T2.service_order_id = T3.service_order_id (+)
  5     AND T3.related_service_order_id = TO_NUMBER(:v0);


can in some cases be rewritten simply as:


ops$tkyte%ORA11GR2> SELECT COUNT(*)
  2    FROM T3
  3   WHERE T3.related_service_order_id = TO_NUMBER(:v0);

that is, a three table join could be rewritten in this case as a single table query.  A query plan goes from access three tables, three indexes - to a single index range scan.

All because of constraints being in place.

http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:4043033500346640019





so, over time, more and more and more transformations are added - one day (as I said in my response) - they may do this rewrite for us...


Query Rewrite

David Aldridge, March 04, 2013 - 12:34 pm UTC

I'm a bit rusty on some of this, but I would try using DBMS_Advanced_Rewrite to declare equivalence between the commonly-written-but-poorly-performing query and the transformed better performing query.

Adding paid_date >= first_date_of_service onto queries for which detail_status_code = 'P' and active_ind = 'Y' might be sufficient for the optimiser to prune on paid_date.
Tom Kyte
March 04, 2013 - 4:41 pm UTC

I don't think that would apply here - since the queries seem to be ad-hoc, you don't know what queries to equivalance to..

David Aldridge, March 05, 2013 - 2:06 am UTC

Ad-hoc queries ought to be OK. The declared mode of the rewrite equivalence would have to be either 'general' or 'recursive' rather than 'text_match', but 'text_match' is really just a very specific restriction that people might use only for optimising a single troublesome query.

Don't forget that this is the same rewrite engine as is used for materialised views, where you might be creating an MV such as "select month,sum(sales) from fact_sales group by month" and expecting rewrite for "select month,sum(sales) from fact_sales where month = date'2012-01-01' group by month".

Tom Kyte
March 05, 2013 - 11:49 am UTC

I'm thinking "in general", maybe if everything is a single table query, but I'm supposing the queries are in general more complex than that.

Some of these things allready happen

lh, April 19, 2013 - 3:12 pm UTC

Hi,

Just to point out, that some of this is allready happening in Oracle 11.2.0.3.


DROP TABLE CLAIMS_ANALYSIS;
create table claimS_analysis (paid_date date not null , first_date_of_service date , active_ind varchar2(1 char) , detail_status_code varchar2(1 char) )
tablespace track1
pctfree 0
partition BY RANGE (paid_date)
INTERVAL( NUMTODSINTERVAL(1,'DAY'))
(
PARTITION NOTHING VALUES LESS THAN (TO_DATE(' 2012-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN'))
) ;

declare
w date := to_date('01012013','ddmmyyyy');
begin
for i in 1..30 loop
for j in i..30 loop
for k in 1..100 loop
insert into claimS_analysis(paid_date, first_date_of_service, active_ind, detail_status_code) values
(W + i+j, W+i, 'Y','P');
-- insert into claiMS_analysis(paid_date, first_date_of_service, active_ind, detail_status_code) values
-- (W + i, W+i+j, 'X','Y');
-- insert into claimS_analysis(paid_date, first_date_of_service, active_ind, detail_status_code) values
-- (W + i, W+i+j, 'Z','X');
end loop;
end loop;
end loop;
commit;
end;
/
execute dbms_stats.gather_table_stats(user,'claimS_analysis');
explain plan for
select
count(*)
from
claimS_analysis
where
active_ind = 'Y' and
detail_status_code = 'P' and
first_date_of_service between to_date('20130105','YYYYMMDD') and
to_date('20130110','YYYYMMDD');
select * from table(dbms_xplan.display);

| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
--------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 12 | 739 (1)| 00:00:01 | | |
| 1 | SORT AGGREGATE | | 1 | 12 | | | | |
| 2 | PARTITION RANGE ALL| | 11117 | 130K| 739 (1)| 00:00:01 | 1 |1048575|
|* 3 | TABLE ACCESS FULL | CLAIMS_ANALYSIS | 11117 | 130K| 739 (1)| 00:00:01 | 1 |1048575|
--------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

3 - filter("FIRST_DATE_OF_SERVICE"<=TO_DATE(' 2013-01-10 00:00:00', 'syyyy-mm-dd
hh24:mi:ss') AND "FIRST_DATE_OF_SERVICE">=TO_DATE(' 2013-01-05 00:00:00', 'syyyy-mm-dd
hh24:mi:ss') AND "ACTIVE_IND"='Y' AND "DETAIL_STATUS_CODE"='P')

Partition pruning doesn't occur.

Add constraint:


alter table claims_analysis add CONSTRAINT WRONG_PRUNING_CHECK CHECK
(paid_date >= first_date_of_service and DETAIL_STATUS_CODE='P' and ACTIVE_IND='Y')
/

explain plan for
select
count(*)
from
claimS_analysis
where
active_ind = 'Y' and
detail_status_code = 'P' and
first_date_of_service between to_date('20130105','YYYYMMDD') and
to_date('20130110','YYYYMMDD');
select * from table(dbms_xplan.display);

| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
-------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 20 | 714 (1)| 00:00:01 | | |
| 1 | SORT AGGREGATE | | 1 | 20 | | | | |
| 2 | PARTITION RANGE ITERATOR| | 11117 | 217K| 714 (1)| 00:00:01 | 372 |1048575|
|* 3 | TABLE ACCESS FULL | CLAIMS_ANALYSIS | 11117 | 217K| 714 (1)| 00:00:01 | 372 |1048575|
-------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

3 - filter("FIRST_DATE_OF_SERVICE"<=TO_DATE(' 2013-01-10 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
"FIRST_DATE_OF_SERVICE">=TO_DATE(' 2013-01-05 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
"ACTIVE_IND"='Y' AND "DETAIL_STATUS_CODE"='P' AND "PAID_DATE">=TO_DATE(' 2013-01-05 00:00:00',
'syyyy-mm-dd hh24:mi:ss'))

From Pstart we can see that partition pruning was done. Pruning doen't occur with correct check constraint:

alter table claims_analysis add CONSTRAINT PRUNING_CHECK CHECK
(detail_status_code != 'P' or active_ind != 'Y' or (paid_date >= first_date_of_service and DETAIL_STATUS_CODE='P' and ACTIVE_IND='Y'))
/



lh

More to Explore

VLDB

If you are new to partitioning, check out Connor McDonald's introduction series here.

VLDB

Documentation set on VLDB and Partitioning.