Skip to Main Content
  • Questions
  • Time Dimension - Cardinality querying on correlated columns

Breadcrumb

XMAS_TREE

The AskTOM team is taking a break over the holiday season, so we're not taking questions or responding to comments, but we might still sneak in a few Office Hours sessions! Thanks for being a member of the AskTOM community.

Question and Answer

Connor McDonald

Thanks for the question, James.

Asked: December 05, 2024 - 6:04 am UTC

Last updated: December 09, 2024 - 12:09 pm UTC

Version: 19.13

Viewed 1000+ times

You Asked

We have nagging plan stability issues which are often traced back to bad cardinality estimates from our time dimension table. There are numerous workarounds (hinting, baselines, profiles, etc) but it would be so much easier if the CBO just got it right.

I think our time dimension table is typical. The most granular data is hourly, the table contains one row for every hour(key column) and has other columns for the day/month/year that the hour rolls up to. This example is simplified but illustrates the point.

create table tst_timetab (hour date, day date, month date, year date, primary key (hour) );            

insert into tst_timetab 
select day+offset/24 hour, trunc(day+offset/24, 'dd') day, trunc(day+offset/24, 'mm') month, trunc(day+offset/24, 'yy') year
from (select to_date('1-jan-2020', 'dd-mon-yyyy') day, (level-1) offset from dual connect by level<=5*365*24);
commit;

exec DBMS_STATS.GATHER_TABLE_STATS(sys_context('USERENV', 'CURRENT_USER'), 'tst_timetab', cascade=>true, method_opt => 'FOR ALL COLUMNS SIZE 255 FOR COLUMNS SIZE 255 (day,hour)', no_invalidate=>false   );

select * 
from tst_timetab 
where day = to_date('20-jan-2022', 'dd-mon-yyyy')
and hour >= to_date('20-jan-2022', 'dd-mon-yyyy') 
and hour < to_date('21-jan-2022', 'dd-mon-yyyy');


We would like to see this return a cardinality > 1 (preferably 24).

DAY and HOUR are correlated predicates (hour determines day). For example, DAY and HOUR might both be specified as predicates because other tables in the join are partitioned on HOUR (thus requiring a bind to do partition elimination or other types of predicate pushdowns).

We've tried multi-column extended stats and various histograms to no avail. Also looked into CREATE DIMENSION statement but not sure how it would help us here. Hundreds of queries are written in this manner, changing to use a new table (or dimension) would seem to have a lot of impact.

LiveSQL with data and stats included. Many thanks.

https://livesql.oracle.com/ords/livesql/s/colpdvbgiquu7vu9hpw572un4

and Chris said...

Can you share an example where the estimates are bad?

I see reasonable estimates in the Live SQL (28 rows) and on 19c (27) & 23ai (24):

19.25:
--------------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name         | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |              |      1 |        |     24 |00:00:00.01 |       3 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| TST_TIMETAB  |      1 |     27 |     24 |00:00:00.01 |       3 |
|*  2 |   INDEX RANGE SCAN                  | SYS_C0021725 |      1 |     27 |     24 |00:00:00.01 |       2 |
--------------------------------------------------------------------------------------------------------------


23.6:

--------------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name         | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |              |      1 |        |     24 |00:00:00.01 |       3 |
|*  1 |  TABLE ACCESS BY INDEX ROWID BATCHED| TST_TIMETAB  |      1 |      1 |     24 |00:00:00.01 |       3 |
|*  2 |   INDEX RANGE SCAN                  | SYS_C0015578 |      1 |     24 |     24 |00:00:00.01 |       2 |
--------------------------------------------------------------------------------------------------------------


An example of a real plan you have with bad estimates may help here.

I think our time dimension table is typical

I have to disagree here. In a typical time dimension table, only the calendar day column is a date; the rest are numbers/strings (occasionally start/end of month/year may be dates too). Every column in your dimension is a date!

Hours would also be stored as integers rather than dates. Often these are in a separate time dimension table rather than a dates table.

I know it's a lot of work, but refactoring your schema to a more typical date dimension may be the best long term solution here.

Another question is why you're filtering on both day and hour. Given hour determines day, what's the point of filtering on both?

If the bad cardinality estimates are down to the optimizer not spotting the correlation between these values, only filtering on one of them seems a good place to start to me.

Rating

  (2 ratings)

We're not taking comments currently, so please try again later if you want to add a comment.

Comments

Cardinality on Correlated Columns (follow-up)

James, December 06, 2024 - 2:03 am UTC

1 row estimated on 19.13. I tried stats using FOR ALL COLUMNS SIZE 255 and FOR ALL COLUMNS SIZE 255 FOR COLUMNS SIZE 255 (day,hour) but they seem to behave the same. In the predicate info, you can see the DAY= filter is the problem.

RE: Why would we use both DAY and HOUR you ask. In larger queries where the time table is joining to both hourly and daily partitioned tables, we include hour= and day= as additional binds so the optimizer can use specific partition stats from those other tables. Also, we usually get a better plan because the optimizer knows immediately what range to use on each table vs. figuring it out based on the time table (which as we see here has issues).

RE: Our time table design. All our data is time-block based and streams in from upstream systems/vendors with a date. I get the theory and it's more space efficient but having the date in the table directly has so many benefits. I'm not even sure that's the root of the problem here.

Thank you. Maybe this is fixed in 19.25!! That will be my Christmas present!

SQL> set echo on autotrace off lines 200 pages 0
Autotrace Disabled
SQL> select version_full from v$instance;
19.13.0.0.0

SQL> 
SQL> DROP TABLE tst_timetab;

Table TST_TIMETAB dropped.

SQL> create table tst_timetab (hour date, day date, month date, year date, primary key (hour) );

Table TST_TIMETAB created.

SQL> 
SQL> insert into tst_timetab 
  2  select day+offset/24 hour, trunc(day+offset/24, 'dd') day, trunc(day+offset/24, 'mm') month, trunc(day+offset/24, 'yy') year
  3  from (select to_date('1-jan-2020', 'dd-mon-yyyy') day, (level-1) offset from dual connect by level<=5*365*24);

43,800 rows inserted.

SQL> commit;

Commit complete.

SQL> 
SQL> exec DBMS_STATS.GATHER_TABLE_STATS(user, 'tst_timetab', cascade=>true, method_opt => 'FOR ALL COLUMNS SIZE 255', no_invalidate=>false   );

PL/SQL procedure successfully completed.

SQL> 
SQL> set autotrace on
Autotrace Enabled
Shows the execution plan as well as statistics of the statement.
SQL> select /*+gather_plan_statistics*/ * 
  2  from tst_timetab 
  3  where day = to_date('20-jan-2022', 'dd-mon-yyyy')
  4  and hour >= to_date('20-jan-2022', 'dd-mon-yyyy') 
  5  and hour < to_date('21-jan-2022', 'dd-mon-yyyy');
20-JAN-2022 00:00:00 20-JAN-2022 00:00:00 01-JAN-2022 00:00:00 01-JAN-2022 00:00:00
...
20-JAN-2022 23:00:00 20-JAN-2022 00:00:00 01-JAN-2022 00:00:00 01-JAN-2022 00:00:00

24 rows selected. 

SQL_ID  3r8b9v6bb717w, child number 0
-------------------------------------
select /*+gather_plan_statistics*/ *  from tst_timetab  where day = 
to_date('20-jan-2022', 'dd-mon-yyyy') and hour >= 
to_date('20-jan-2022', 'dd-mon-yyyy')  and hour < 
to_date('21-jan-2022', 'dd-mon-yyyy')
 
Plan hash value: 1518437623
 
---------------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name          | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |               |      1 |        |     24 |00:00:00.01 |       3 |
|*  1 |  TABLE ACCESS BY INDEX ROWID BATCHED| TST_TIMETAB   |      1 |      1 |     24 |00:00:00.01 |       3 |
|*  2 |   INDEX RANGE SCAN                  | SYS_C00432714 |      1 |     25 |     24 |00:00:00.01 |       2 |
---------------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter("DAY"=TO_DATE(' 2022-01-20 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
   2 - access("HOUR">=TO_DATE(' 2022-01-20 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "HOUR"<TO_DATE(' 
              2022-01-21 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
 

Statistics
-----------------------------------------------------------
               1  DB time
               5  Requests to/from client
           16384  cell physical IO interconnect bytes
               5  consistent gets
               3  consistent gets examination
               2  consistent gets examination (fastpath)
               5  consistent gets from cache
               2  consistent gets pin
               2  consistent gets pin (fastpath)
               5  db block gets
               5  db block gets from cache
               4  db block gets from cache (fastpath)
               3  enqueue releases
               3  enqueue requests
               1  messages sent
               8  non-idle wait count
               4  opened cursors cumulative
               1  opened cursors current
               1  physical read total IO requests
           16384  physical read total bytes
               1  pinned cursors current
               3  recursive calls
              10  session logical reads
               6  user calls
SQL> 
SQL> exec DBMS_STATS.GATHER_TABLE_STATS(user, 'tst_timetab', cascade=>true, method_opt => 'FOR ALL COLUMNS SIZE 255 FOR COLUMNS SIZE 255 (day,hour)', no_invalidate=>false   );

PL/SQL procedure successfully completed.

SQL> 
SQL> select /*+gather_plan_statistics*/ * 
  2  from tst_timetab 
  3  where day = to_date('20-jan-2022', 'dd-mon-yyyy')
  4  and hour >= to_date('20-jan-2022', 'dd-mon-yyyy') 
  5  and hour < to_date('21-jan-2022', 'dd-mon-yyyy');
20-JAN-2022 00:00:00 20-JAN-2022 00:00:00 01-JAN-2022 00:00:00 01-JAN-2022 00:00:00
...
20-JAN-2022 23:00:00 20-JAN-2022 00:00:00 01-JAN-2022 00:00:00 01-JAN-2022 00:00:00

24 rows selected. 

SQL_ID  3r8b9v6bb717w, child number 0
-------------------------------------
select /*+gather_plan_statistics*/ *  from tst_timetab  where day = 
to_date('20-jan-2022', 'dd-mon-yyyy') and hour >= 
to_date('20-jan-2022', 'dd-mon-yyyy')  and hour < 
to_date('21-jan-2022', 'dd-mon-yyyy')
 
Plan hash value: 1518437623
 
---------------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name          | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |               |      1 |        |     24 |00:00:00.01 |       3 |
|*  1 |  TABLE ACCESS BY INDEX ROWID BATCHED| TST_TIMETAB   |      1 |      1 |     24 |00:00:00.01 |       3 |
|*  2 |   INDEX RANGE SCAN                  | SYS_C00432714 |      1 |     19 |     24 |00:00:00.01 |       2 |
---------------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter("DAY"=TO_DATE(' 2022-01-20 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
   2 - access("HOUR">=TO_DATE(' 2022-01-20 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "HOUR"<TO_DATE(' 
              2022-01-21 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
 

Statistics
-----------------------------------------------------------
               5  Requests to/from client
               5  consistent gets
               1  consistent gets examination
               1  consistent gets examination (fastpath)
               5  consistent gets from cache
               4  consistent gets pin
               4  consistent gets pin (fastpath)
               1  enqueue releases
               1  enqueue requests
               5  non-idle wait count
               3  opened cursors cumulative
               1  opened cursors current
               1  pinned cursors current
               2  recursive calls
               5  session logical reads
               6  user calls


Chris Saxon
December 09, 2024 - 12:09 pm UTC

I'm still a bit lost as to what the actual problem is here. The estimates in these examples are all reasonably accurate.

Also, we usually get a better plan because the optimizer knows immediately what range to use on each table
So what exactly is causing you to get "bad" plans?

I get the theory and it's more space efficient but having the date in the table directly has so many benefits.

What are the benefits you're getting?

To be clear - you'd usually still have a date column. You'd also have columns for the parts of the date you're interested in. Doing this isn't really about spacing saving. It's that it makes queries comparing part of a date easy, e.g.:

Compare December this year to last year (month = 12)
Show all activity at weekends (day in ( 'SAT', 'SUN' ))
What was the activity at 10am every day this week (hour = 10)

If you store hours/months/years/etc. as dates, you need to extract out the bit you're interested in. Whereas if they're in their own columns you can compare them directly as in the examples above.

using Dynamic sampling level = Auto

Rajeshwaran, Jeyabal, December 09, 2024 - 3:55 pm UTC

I think the question here is the optimizer loses the right estimates when it hits the table segment using the rowids returned from Index range scan.

demo@ATP23ai> select * from table( dbms_xplan.display_cursor(format=>'allstats last'));

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  05wru5mgg6kyw, child number 1
-------------------------------------
select * from tst_timetab where day = to_date('20-jan-2022',
'dd-mon-yyyy') and hour >= to_date('20-jan-2022', 'dd-mon-yyyy') and
hour < to_date('21-jan-2022', 'dd-mon-yyyy')

Plan hash value: 882017390

--------------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name         | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |              |      1 |        |     24 |00:00:00.01 |       3 |
|*  1 |  TABLE ACCESS BY INDEX ROWID BATCHED| TST_TIMETAB  |      1 |      1 |     24 |00:00:00.01 |       3 |
|*  2 |   INDEX RANGE SCAN                  | SYS_C0024967 |      1 |     29 |     24 |00:00:00.01 |       2 |
--------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("DAY"=TO_DATE(' 2022-01-20 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
   2 - access("HOUR">=TO_DATE(' 2022-01-20 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "HOUR"<TO_DATE('
              2022-01-21 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))


23 rows selected.


In the above plan, Index range scan returned 24 rowids , which is close to the estimated rows.
but when doing "Table access by index rowid" the estimates were way off from the actuals - hence the problem is.

one way to solve them is using Dynamic sampling level = 11

demo@ATP23ai> set autotrace traceonly exp
demo@ATP23ai> select /*+ dynamic_sampling(11) */ *
  2  from tst_timetab
  3  where day = to_date('20-jan-2022', 'dd-mon-yyyy')
  4  and hour >= to_date('20-jan-2022', 'dd-mon-yyyy')
  5  and hour < to_date('21-jan-2022', 'dd-mon-yyyy');

Execution Plan
----------------------------------------------------------
Plan hash value: 882017390

----------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |              |    24 |   768 |     3   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS BY INDEX ROWID BATCHED| TST_TIMETAB  |    24 |   768 |     3   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | SYS_C0024967 |    24 |       |     2   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("DAY"=TO_DATE(' 2022-01-20 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
   2 - access("HOUR">=TO_DATE(' 2022-01-20 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "HOUR"<TO_DATE(' 2022-01-21 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

More to Explore

Performance

Get all the information about database performance in the Database Performance guide.