You Asked

Hello, and thank you for all you do. AskTom is my go-to site when seeking “off the beaten path” information.

My question is related to the column usage data used to identify columns as candidates for histograms when collecting statistics (stored in SYS.COL_USAGE$). This may be a good question to forward to Nigel Bayliss and his team.

A bit of background:

We maintain a data warehouse environment, and the ETL processes we employ drop many tables and then use CTAS to re-create them from the source database on a weekly basis. That ETL method presents a problem, as the column usage metadata is deleted when the tables are dropped, leaving behind no history of column usage for DBMS_STATS to use when statistics are gathered on the “new versions” of the tables.

The drop-and-replace method is used only for smaller tables (“smaller” by data warehouse standards, which might be considered large tables in other environments – many of our smaller tables contain tens of millions of rows). The tables are queried heavily both by routinely scheduled processes and by ad-hoc queries.

So, here is what happens to a typical table: During the weekly ETL cycle, the table is dropped, and with it goes the column usage metadata that has been collected since the previous ETL cycle. The table is then re-created via CTAS, and statistics are gathered. However, since the table has been freshly re-created, there is no column usage data available for DBMS_STATS to use in identifying the columns in that table that are typically used in joins or query predicates. New column usage data is collected as processes query the table during the week. But since there are no inserts, updates, or deletes applied to the table over the course of the week, the statistics never go stale – therefore statistics aren’t gathered on the table again, so the column usage data is never employed to identify histogram-worthy columns during stats gathering. So, histograms that may be beneficial to the query optimizer are never created on a lot of columns in a lot of tables. It’s a catch-22.

Okay, let’s get to the question:

Is there a way to capture just column usage data prior to dropping tables so that it can be re-inserted into the data dictionary after a table has been re-created?

A simple answer would be to gather stats more frequently, but that’s not really an option. We’re talking about thousands of tables, and the auto stats job already has a hard time keeping up. There simply isn’t enough time and resources to gather stats more frequently in our environment.

I’ve done some testing with DBMS_STATS.EXPORT_TABLE_STATS and IMPORT_TABLE_STATS and the results were not what I was hoping for. I created a smaller test version of one of our large production tables by choosing just a few of the partitions, exported the statistics from the production version, and imported those statistics into the test schema. I then used dbms_stats.report_col_usage to check column usage statistics for both tables. For the production table, report_col_usage reported 419 columns and combinations of columns. For the test version, report_col_usage reported only 4 columns (which were captured from test queries that I ran against the test table). So, exporting and importing statistics doesn’t appear to be a solution either, as the column_usage metadata does not appear to be captured. And even had I gotten the hoped-for results, I’m not sure it would be a viable solution anyway, as table structures are often modified between one ETL cycle and the next (via changes to the CTAS statements used in the ETL).

I guess it might be possible to cobble together SQL to capture the column usage data prior to an ETL cycle and then re-build and insert it back into SYS.COL_USAGE$ after the ETL cycle and before statistics are gathered. That would certainly require some re-mapping (because COL_USAGE$ stores OBJ# and INTCOL# values rather than owner, object_name, and column_name values). However, altering data in the data dictionary is probably not advisable (and I’m pretty sure Oracle Support would agree with that).

So, I’m seeking advice on how we might remedy our situation with a solution that would provide DBMS_STATS with the information it needs to identify columns which are candidates for histograms in SYS.COL_USAGE$. And, given the limited amount of time we are given to execute ETL cycles, the solution must be fast! Any input you can provide would be very much appreciated.

*** ADDED THE TEXT BELOW AFTER THE ORIGINAL QUESTION WAS SUBMITTED ***

A day or two after submitting this question, while reading Nigel Bayliss' blog titled "Efficient Statistics Maintenance for Partitioned Tables Using Incremental Statistics", I learned that changes to column usage can trigger the database to re-gather statistics for a table when using 'SIZE AUTO' for the METHOD_OPT preference (which we are). The fact that automatic statistics gathering struggles to keep up in our production environment (in fact, it rarely finishes the work it has queued within any given maintenance window) may be related. As each week between ETL cycles goes by, more and more differing queries are submitted against the tables, and different columns are used as predicates. So, it could be that statistics are being automatically gathered on the tables throughout the week as a result of changes to column usage, even though the data itself is static. If that is true, then it becomes even more important for use to find a way to retain the column usage data across ETL cycles (or change METHOD_OPT to something like "for all columns size 1" which isn't a particularly attractive option either). I'm going to work on trying to prove or disprove that theory. The plot thickens!

Thank you!

and Connor said...

As you've surmised, hacking COL_USAGE$ would probably work but also puts you firmly into the realm of unsupported :-)

Is the following strategy an option:

- you've gathered stats on the "version 1" of the table (after user queries etc) so you have histograms etc based on col usage.
- before you drop, you export stats
- then drop/recreate, thus creating "version 2" of the table
- now import stats (these are wrong because the reflect a different version of the data
- you gather using "repeat" to reflect the previous *consequences* of the col_usage data.

Example:

SQL> create table t as select * from dba_objects;

Table created.

SQL> create index ix1 on t ( object_id );

Index created.

SQL> create index ix2 on t ( owner );

Index created.

--
-- some fakery to populate col_usage$
--

SQL> select count(*) from t where object_id = 1;

  COUNT(*)
----------
         0

SQL> exec dbms_stats.flush_database_monitoring_info

PL/SQL procedure successfully completed.

SQL> select count(*) from t where object_id = 2;

  COUNT(*)
----------
         1

SQL> exec dbms_stats.flush_database_monitoring_info

PL/SQL procedure successfully completed.

SQL> select count(*) from t where object_id = 3;

  COUNT(*)
----------
         1

SQL> exec dbms_stats.flush_database_monitoring_info

PL/SQL procedure successfully completed.

SQL> select count(*) from t where object_id = 4;

  COUNT(*)
----------
         1

SQL> exec dbms_stats.flush_database_monitoring_info

PL/SQL procedure successfully completed.

SQL> select count(*) from t where object_id = 5;

  COUNT(*)
----------
         1

SQL> exec dbms_stats.flush_database_monitoring_info

PL/SQL procedure successfully completed.

SQL> select count(*) from t where object_id = 6;

  COUNT(*)
----------
         1

SQL> exec dbms_stats.flush_database_monitoring_info

PL/SQL procedure successfully completed.

SQL> select count(*) from t where object_id = 7;

  COUNT(*)
----------
         1

SQL> exec dbms_stats.flush_database_monitoring_info

PL/SQL procedure successfully completed.

SQL> select count(*) from t where object_id = 8;

  COUNT(*)
----------
         1

SQL> exec dbms_stats.flush_database_monitoring_info

PL/SQL procedure successfully completed.

SQL> select count(*) from t where object_id = 9;

  COUNT(*)
----------
         1

SQL> exec dbms_stats.flush_database_monitoring_info

PL/SQL procedure successfully completed.

SQL> select count(*) from t where object_id = 10;

  COUNT(*)
----------
         1

SQL>
SQL> select count(*) from t where object_id <= 1;

  COUNT(*)
----------
         0

SQL> exec dbms_stats.flush_database_monitoring_info

PL/SQL procedure successfully completed.

SQL> select count(*) from t where object_id <= 2;

  COUNT(*)
----------
         1

SQL> exec dbms_stats.flush_database_monitoring_info

PL/SQL procedure successfully completed.

SQL> select count(*) from t where object_id <= 3;

  COUNT(*)
----------
         2

SQL> exec dbms_stats.flush_database_monitoring_info

PL/SQL procedure successfully completed.

SQL> select count(*) from t where object_id <= 4;

  COUNT(*)
----------
         3

SQL> exec dbms_stats.flush_database_monitoring_info

PL/SQL procedure successfully completed.

SQL> select count(*) from t where object_id <= 5;

  COUNT(*)
----------
         4

SQL> exec dbms_stats.flush_database_monitoring_info

PL/SQL procedure successfully completed.

SQL> select count(*) from t where object_id <= 6;

  COUNT(*)
----------
         5

SQL> exec dbms_stats.flush_database_monitoring_info

PL/SQL procedure successfully completed.

SQL> select count(*) from t where object_id <= 7;

  COUNT(*)
----------
         6

SQL> exec dbms_stats.flush_database_monitoring_info

PL/SQL procedure successfully completed.

SQL> select count(*) from t where object_id <= 8;

  COUNT(*)
----------
         7

SQL> exec dbms_stats.flush_database_monitoring_info

PL/SQL procedure successfully completed.

SQL> select count(*) from t where object_id <= 9;

  COUNT(*)
----------
         8

SQL> exec dbms_stats.flush_database_monitoring_info

PL/SQL procedure successfully completed.

SQL> select count(*) from t where object_id <= 10;

  COUNT(*)
----------
         9

SQL> exec dbms_stats.flush_database_monitoring_info

PL/SQL procedure successfully completed.

SQL> select count(*) from t where object_id is null;

  COUNT(*)
----------
        10

SQL> exec dbms_stats.flush_database_monitoring_info

PL/SQL procedure successfully completed.

SQL> select max(object_id) from t where owner = 'SYS';

MAX(OBJECT_ID)
--------------
        979220

SQL> exec dbms_stats.flush_database_monitoring_info

PL/SQL procedure successfully completed.

SQL> select max(object_id) from t where owner = 'PUBLIC';

MAX(OBJECT_ID)
--------------
        203656

SQL> exec dbms_stats.flush_database_monitoring_info

PL/SQL procedure successfully completed.

SQL> select max(object_id) from t where owner >= 'OUTLN';

MAX(OBJECT_ID)
--------------
        979220

SQL> exec dbms_stats.flush_database_monitoring_info

PL/SQL procedure successfully completed.

SQL> select max(object_id) from t where owner = 'SYSTEM';

MAX(OBJECT_ID)
--------------
        163303

SQL> exec dbms_stats.flush_database_monitoring_info

PL/SQL procedure successfully completed.

SQL> select max(object_id) from t where owner = 'GSMADMIN_INTERNAL';

MAX(OBJECT_ID)
--------------
        159799

SQL> exec dbms_stats.flush_database_monitoring_info

PL/SQL procedure successfully completed.

SQL> select max(object_id) from t where owner <= 'DBSFWUSER';

MAX(OBJECT_ID)
--------------
        903936

SQL> exec dbms_stats.flush_database_monitoring_info

PL/SQL procedure successfully completed.

SQL> select max(object_id) from t where owner = 'AUDSYS';

MAX(OBJECT_ID)
--------------
        903936

SQL> exec dbms_stats.flush_database_monitoring_info

PL/SQL procedure successfully completed.

SQL> select max(object_id) from t where owner = 'ORACLE_OCM';

MAX(OBJECT_ID)
--------------
         18377

SQL> exec dbms_stats.flush_database_monitoring_info

PL/SQL procedure successfully completed.

SQL> select max(object_id) from t where owner != 'DBSNMP';

MAX(OBJECT_ID)
--------------
        994458

SQL> exec dbms_stats.flush_database_monitoring_info

PL/SQL procedure successfully completed.

SQL> select max(object_id) from t where owner = 'APPQOSSYS';

MAX(OBJECT_ID)
--------------
         20575

SQL> exec dbms_stats.flush_database_monitoring_info

PL/SQL procedure successfully completed.

SQL> select max(object_id) from t where owner = 'REMOTE_SCHEDULER_AGENT';

MAX(OBJECT_ID)
--------------
         20637

SQL> exec dbms_stats.flush_database_monitoring_info

PL/SQL procedure successfully completed.

SQL> select max(object_id) from t where owner = 'XDB';

MAX(OBJECT_ID)
--------------
        147576

SQL> exec dbms_stats.flush_database_monitoring_info

PL/SQL procedure successfully completed.

SQL> select max(object_id) from t where owner in ('WMSYS','OJVMSYS');

MAX(OBJECT_ID)
--------------
        203615

SQL>
SQL> select object_id from user_objects where object_name = 'T';

 OBJECT_ID
----------
    998272

SQL> exec dbms_stats.flush_database_monitoring_info

PL/SQL procedure successfully completed.


--
-- make sure col_usage$ is good to go
--


SQL> select * from sys.col_usage$
  2  where obj# = ( select object_id from user_objects where object_name = 'T' )
  3  @pr
==============================
OBJ#                          : 998272
INTCOL#                       : 1
EQUALITY_PREDS                : 11
EQUIJOIN_PREDS                : 0
NONEQUIJOIN_PREDS             : 0
RANGE_PREDS                   : 2
LIKE_PREDS                    : 0
NULL_PREDS                    : 0
TIMESTAMP                     : 04-NOV-21
FLAGS                         : 521
==============================
OBJ#                          : 998272
INTCOL#                       : 4
EQUALITY_PREDS                : 10
EQUIJOIN_PREDS                : 0
NONEQUIJOIN_PREDS             : 0
RANGE_PREDS                   : 10
LIKE_PREDS                    : 0
NULL_PREDS                    : 1
TIMESTAMP                     : 04-NOV-21
FLAGS                         : 553

PL/SQL procedure successfully completed.


--
-- gather our stats and check for histograms etc
--


SQL> exec dbms_stats.gather_table_stats('','T');

PL/SQL procedure successfully completed.

SQL> select COLUMN_NAME,NUM_DISTINCT,
  2      ( select decode(count(*),0,to_number(null),count(*)) from all_tab_histograms
  3        where owner = all_tab_cols.owner
  4        and   table_name = all_tab_cols.table_name
  5        and   column_name = all_tab_cols.column_name ) hist_cnt
  6  from all_tab_cols
  7  where table_name = 'T'
  8  and owner = user
  9  order by COLUMN_ID;

COLUMN_NAME                    NUM_DISTINCT   HIST_CNT
------------------------------ ------------ ----------
OWNER                                    47         47
OBJECT_NAME                           66704          2
SUBOBJECT_NAME                          684          2
OBJECT_ID                             80158        254
DATA_OBJECT_ID                         9610          2
OBJECT_TYPE                              52          2
CREATED                                2084          2
LAST_DDL_TIME                          2392          2
TIMESTAMP                              2302          2
STATUS                                    2          2
TEMPORARY                                 2          2
GENERATED                                 2          2
SECONDARY                                 2          2
NAMESPACE                                25          2
EDITION_NAME                              0
SHARING                                   4          2
EDITIONABLE                               2          2
ORACLE_MAINTAINED                         2          2
APPLICATION                               1          2
DEFAULT_COLLATION                         1          2
DUPLICATED                                1          2
SHARDED                                   1          2
CREATED_APPID                             0
CREATED_VSNID                             0
MODIFIED_APPID                            0
MODIFIED_VSNID                            0

26 rows selected.


--
-- export those stats, not really for the data itself, but to get the "style" of the data
-- ie, which cols have histograms etc
--

SQL> exec dbms_stats.export_table_stats('','T',stattab=>'ST',statown=>user);

PL/SQL procedure successfully completed.


--
-- here comes version 2 of the table, it will be smaller (only 50k rows)
--

SQL> drop table t purge;

Table dropped.

SQL>
SQL> create table t as select * from dba_objects where rownum <= 50000;

Table created.

SQL> create index ix1 on t ( object_id );

Index created.

SQL> create index ix2 on t ( owner );

Index created.

--
-- nothing in col_usage
--

SQL> select * from sys.col_usage$
  2  where obj# = ( select object_id from user_objects where object_name = 'T' )
  3  @pr

PL/SQL procedure successfully completed.

--
-- import the stats 
--

SQL> exec dbms_stats.import_table_stats('','T',stattab=>'ST',statown=>user);

PL/SQL procedure successfully completed.

--
-- so that stats are there, but the values are wrong (because they reflect the previous version of T)
--

SQL> select COLUMN_NAME,NUM_DISTINCT,
  2      ( select decode(count(*),0,to_number(null),count(*)) from all_tab_histograms
  3        where owner = all_tab_cols.owner
  4        and   table_name = all_tab_cols.table_name
  5        and   column_name = all_tab_cols.column_name ) hist_cnt
  6  from all_tab_cols
  7  where table_name = 'T'
  8  and owner = user
  9  order by COLUMN_ID;

COLUMN_NAME                    NUM_DISTINCT   HIST_CNT
------------------------------ ------------ ----------
OWNER                                    47         47
OBJECT_NAME                           66704          2
SUBOBJECT_NAME                          684          2
OBJECT_ID                             80158        254
DATA_OBJECT_ID                         9610          2
OBJECT_TYPE                              52          2
CREATED                                2084          2
LAST_DDL_TIME                          2392          2
TIMESTAMP                              2302          2
STATUS                                    2          2
TEMPORARY                                 2          2
GENERATED                                 2          2
SECONDARY                                 2          2
NAMESPACE                                25          2
EDITION_NAME                              0
SHARING                                   4          2
EDITIONABLE                               2          2
ORACLE_MAINTAINED                         2          2
APPLICATION                               1          2
DEFAULT_COLLATION                         1          2
DUPLICATED                                1          2
SHARDED                                   1          2
CREATED_APPID                             0
CREATED_VSNID                             0
MODIFIED_APPID                            0
MODIFIED_VSNID                            0

26 rows selected.

--
-- but use the old data to drive the outcome of the new gather
--


SQL> exec dbms_stats.gather_table_stats('','T',method_opt=>'for all columns size REPEAT');

PL/SQL procedure successfully completed.

SQL> select COLUMN_NAME,NUM_DISTINCT,
  2      ( select decode(count(*),0,to_number(null),count(*)) from all_tab_histograms
  3        where owner = all_tab_cols.owner
  4        and   table_name = all_tab_cols.table_name
  5        and   column_name = all_tab_cols.column_name ) hist_cnt
  6  from all_tab_cols
  7  where table_name = 'T'
  8  and owner = user
  9  order by COLUMN_ID;

COLUMN_NAME                    NUM_DISTINCT   HIST_CNT
------------------------------ ------------ ----------
OWNER                                    14         14
OBJECT_NAME                           44244          2
SUBOBJECT_NAME                          204          2
OBJECT_ID                             50000        254
DATA_OBJECT_ID                         4564          2
OBJECT_TYPE                              42          2
CREATED                                 899          2
LAST_DDL_TIME                          1201          2
TIMESTAMP                              1057          2
STATUS                                    1          2
TEMPORARY                                 2          2
GENERATED                                 2          2
SECONDARY                                 1          2
NAMESPACE                                19          2
EDITION_NAME                              0
SHARING                                   4          2
EDITIONABLE                               2          2
ORACLE_MAINTAINED                         2          2
APPLICATION                               1          2
DEFAULT_COLLATION                         1          2
DUPLICATED                                1          2
SHARDED                                   1          2
CREATED_APPID                             0
CREATED_VSNID                             0
MODIFIED_APPID                            0
MODIFIED_VSNID                            0

26 rows selected.

SQL>

Rating

(3 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Dan, November 04, 2021 - 3:36 pm UTC

Hmmm. I'll have to wear my thinking cap for a while on that.....

Of course, what I was hoping for was a simple, elegant solution that would allow us to export and import just the column usage data with slick, Oracle-developed procedures (even though I was 99.5% sure that wasn't going to happen). Perhaps that is something I can submit as an enhancement request. I mean, we can't be the only customer facing the same issue, right? And, on the surface, it doesn't seem like it would be difficult to develop..

A couple of things pop into my head immediately though, which I'll be able to answer through testing:

-- I assume that the import_table_stats procedure can handle changes that occur to the table structures (i.e. dropping or adding columns) between the time the stats are exported and imported. Tables are modified frequently in our systems to accommodate ever-changing data requirements - and it's easy to do for most tables since they are created with CTAS.

-- I'm concerned about how much time exporting and importing the stats on a table-by-table basis will add to our weekly ETL cycle. We are constantly looking for ways to improve the process that reduce the overall elapsed time for ETL. We really can't afford to use more time. Of course, the trade-off is ETL cycle time versus potentially reduced post-ETL query execution time (because of improved stats).

I suspect the solution might involve being selective and choosing specific tables on which to apply the method; those where the row counts are >= N or the number of columns >= X, and those which are most frequently queried. From there, I can examine the column usage data that is accumulated over a week for tables which meet that criteria, and test with them to see how many histograms gather_table_stats actually produces and on which columns (and ultimately whether those histograms actually improve query performance). The best I can do under these circumstances is to try and identify the truly heavy-hitters - there's just too many tables in this environment to tackle them on a table-by-table basis.

Thanks for your input and suggestion.

November 09, 2021 - 3:44 am UTC

I assume that the import_table_stats procedure can handle changes that occur to the table structures (i.e. dropping or adding columns) between the time the stats are exported and imported.

Although there comes a point where I'm thinking - if I'm significantly modifying the structure, do I really *want* to be holding onto pre-existing column usage data/stats etc ?

I'm concerned about how much time exporting and importing the stats on a table-by-table basis will add to our weekly ETL cycle

Very much depends on how much stats data you have to export, but normally its very quick. Of course, 500 columns each with 2000 histograms buckets could be an exception :-)

Dan, November 04, 2021 - 4:59 pm UTC

One other cause for concern in applying your suggested work-around is a statement I saw in an article about how the METHOD_OPT parameter works, by Maria Colgan (way back in 2013):

https://blogs.oracle.com/optimizer/post/how-does-the-method_opt-parameter-work

"When the histogram is re-gathered with SIZE REPEAT, the newly created histogram will use at most 5 buckets and may not been of good quality."

She also reiterated her apprehension about using SIZE REPEAT as it relates to potentially negative repercussions when the data grows or changes in a "SQLMaria" article in 2017:

https://sqlmaria.com/2017/05/09/optimizer-histograms/

"What’s the down side to doing this?
The current number of buckets used in each histogram becomes the limit on the maximum number of buckets used for any histogram created in the future."

In that same article, she offers an alternative solution - to use table preferences to specify the column(s) on which you want histograms to be produced:

"Alternative Solution

If the default value of METHOD_OPT parameter, doesn’t work in your particular environment, as you only want to allow oracle to create a histogram on this specific column(s) in a table then use table preferences to control it and not SIZE REPEAT.

Let’s assume we only wanted to create a histogram on the CUST_ID of the SALES table. We can use the following command to prevent the Optimizer from create histograms on any other column in the SALES table.

BEGIN
dbms_stats.Set_table_prefs('SH', 'SALES',
'METHOD_OPT'=>'FOR ALL COLUMNS SIZE 1, FOR COLUMNS SIZE AUTO CUST_ID');
END;
/"

She then continues with a recommendation:

"What would I recommend?

Always use the default value for METHOD_OPT. This is based on my belief that you should always start with the default. Yes, there will be exceptions to this rule but it should work for at least 80% of systems, 80% of the time."

So, after testing by examining the accumulated column usage data and testing gather_table_stats to determine which columns in a given candidate table actually result in the production of histograms, that might a workable solution, assuming multiple columns can be specified in the METHOD_OPT parameter, something like this:

dbms_stats.set_table_prefs('SH', 'SALES', 'METHOD_OPT'=>'FOR ALL COLUMNS SIZE 1, FOR COLUMNS SIZE AUTO (EMPNO,DETPNO)');

Thanks again.

November 09, 2021 - 3:47 am UTC

REPEAT is indeed recommended for niche cases, but you presented a fairly niche problem :-)

Similarly, if would seem to odd to me if you wanted to retain column usage data, but in the same breath say, but we're going to reload that column with dramatically different data...*but* we expect the end user usage patten to be the same ....

Anyway, you could opt *never* to import the stats, and just use the exported stat table itself to derive things like those columns that have histograms and then manually set those preferences.

Truncate table.

Rajeshwaran, Jeyabal, November 08, 2021 - 7:54 am UTC

why no just Truncate and reload, rather than drop and recreate ? since doing that going to solve the problem for us.

More to Explore

Performance

Get all the information about database performance in the Database Performance guide.

Breadcrumb

Question and Answer

You Asked

and Connor said...

Rating

Comments

Truncate table.

More to Explore

Performance

Column Usage Data

Breadcrumb

Question and Answer

You Asked

and Connor said...

Rating

Comments

Truncate table.

More to Explore

Performance