You Asked
Hello, and thank you for all you do. AskTom is my go-to site when seeking “off the beaten path” information.
My question is related to the column usage data used to identify columns as candidates for histograms when collecting statistics (stored in SYS.COL_USAGE$). This may be a good question to forward to Nigel Bayliss and his team.
A bit of background:
We maintain a data warehouse environment, and the ETL processes we employ drop many tables and then use CTAS to re-create them from the source database on a weekly basis. That ETL method presents a problem, as the column usage metadata is deleted when the tables are dropped, leaving behind no history of column usage for DBMS_STATS to use when statistics are gathered on the “new versions” of the tables.
The drop-and-replace method is used only for smaller tables (“smaller” by data warehouse standards, which might be considered large tables in other environments – many of our smaller tables contain tens of millions of rows). The tables are queried heavily both by routinely scheduled processes and by ad-hoc queries.
So, here is what happens to a typical table: During the weekly ETL cycle, the table is dropped, and with it goes the column usage metadata that has been collected since the previous ETL cycle. The table is then re-created via CTAS, and statistics are gathered. However, since the table has been freshly re-created, there is no column usage data available for DBMS_STATS to use in identifying the columns in that table that are typically used in joins or query predicates. New column usage data is collected as processes query the table during the week. But since there are no inserts, updates, or deletes applied to the table over the course of the week, the statistics never go stale – therefore statistics aren’t gathered on the table again, so the column usage data is never employed to identify histogram-worthy columns during stats gathering. So, histograms that may be beneficial to the query optimizer are never created on a lot of columns in a lot of tables. It’s a catch-22.
Okay, let’s get to the question:
Is there a way to capture just column usage data prior to dropping tables so that it can be re-inserted into the data dictionary after a table has been re-created?
A simple answer would be to gather stats more frequently, but that’s not really an option. We’re talking about thousands of tables, and the auto stats job already has a hard time keeping up. There simply isn’t enough time and resources to gather stats more frequently in our environment.
I’ve done some testing with DBMS_STATS.EXPORT_TABLE_STATS and IMPORT_TABLE_STATS and the results were not what I was hoping for. I created a smaller test version of one of our large production tables by choosing just a few of the partitions, exported the statistics from the production version, and imported those statistics into the test schema. I then used dbms_stats.report_col_usage to check column usage statistics for both tables. For the production table, report_col_usage reported 419 columns and combinations of columns. For the test version, report_col_usage reported only 4 columns (which were captured from test queries that I ran against the test table). So, exporting and importing statistics doesn’t appear to be a solution either, as the column_usage metadata does not appear to be captured. And even had I gotten the hoped-for results, I’m not sure it would be a viable solution anyway, as table structures are often modified between one ETL cycle and the next (via changes to the CTAS statements used in the ETL).
I guess it might be possible to cobble together SQL to capture the column usage data prior to an ETL cycle and then re-build and insert it back into SYS.COL_USAGE$ after the ETL cycle and before statistics are gathered. That would certainly require some re-mapping (because COL_USAGE$ stores OBJ# and INTCOL# values rather than owner, object_name, and column_name values). However, altering data in the data dictionary is probably not advisable (and I’m pretty sure Oracle Support would agree with that).
So, I’m seeking advice on how we might remedy our situation with a solution that would provide DBMS_STATS with the information it needs to identify columns which are candidates for histograms in SYS.COL_USAGE$. And, given the limited amount of time we are given to execute ETL cycles, the solution must be fast! Any input you can provide would be very much appreciated.
*** ADDED THE TEXT BELOW AFTER THE ORIGINAL QUESTION WAS SUBMITTED ***
A day or two after submitting this question, while reading Nigel Bayliss' blog titled "Efficient Statistics Maintenance for Partitioned Tables Using Incremental Statistics", I learned that changes to column usage can trigger the database to re-gather statistics for a table when using 'SIZE AUTO' for the METHOD_OPT preference (which we are). The fact that automatic statistics gathering struggles to keep up in our production environment (in fact, it rarely finishes the work it has queued within any given maintenance window) may be related. As each week between ETL cycles goes by, more and more differing queries are submitted against the tables, and different columns are used as predicates. So, it could be that statistics are being automatically gathered on the tables throughout the week as a result of changes to column usage, even though the data itself is static. If that is true, then it becomes even more important for use to find a way to retain the column usage data across ETL cycles (or change METHOD_OPT to something like "for all columns size 1" which isn't a particularly attractive option either). I'm going to work on trying to prove or disprove that theory. The plot thickens!
Thank you!
and Connor said...
As you've surmised, hacking COL_USAGE$ would probably work but also puts you firmly into the realm of unsupported :-)
Is the following strategy an option:
- you've gathered stats on the "version 1" of the table (after user queries etc) so you have histograms etc based on col usage.
- before you drop, you export stats
- then drop/recreate, thus creating "version 2" of the table
- now import stats (these are wrong because the reflect a different version of the data
- you gather using "repeat" to reflect the previous *consequences* of the col_usage data.
Example:
SQL> create table t as select * from dba_objects;
Table created.
SQL> create index ix1 on t ( object_id );
Index created.
SQL> create index ix2 on t ( owner );
Index created.
--
-- some fakery to populate col_usage$
--
SQL> select count(*) from t where object_id = 1;
COUNT(*)
----------
0
SQL> exec dbms_stats.flush_database_monitoring_info
PL/SQL procedure successfully completed.
SQL> select count(*) from t where object_id = 2;
COUNT(*)
----------
1
SQL> exec dbms_stats.flush_database_monitoring_info
PL/SQL procedure successfully completed.
SQL> select count(*) from t where object_id = 3;
COUNT(*)
----------
1
SQL> exec dbms_stats.flush_database_monitoring_info
PL/SQL procedure successfully completed.
SQL> select count(*) from t where object_id = 4;
COUNT(*)
----------
1
SQL> exec dbms_stats.flush_database_monitoring_info
PL/SQL procedure successfully completed.
SQL> select count(*) from t where object_id = 5;
COUNT(*)
----------
1
SQL> exec dbms_stats.flush_database_monitoring_info
PL/SQL procedure successfully completed.
SQL> select count(*) from t where object_id = 6;
COUNT(*)
----------
1
SQL> exec dbms_stats.flush_database_monitoring_info
PL/SQL procedure successfully completed.
SQL> select count(*) from t where object_id = 7;
COUNT(*)
----------
1
SQL> exec dbms_stats.flush_database_monitoring_info
PL/SQL procedure successfully completed.
SQL> select count(*) from t where object_id = 8;
COUNT(*)
----------
1
SQL> exec dbms_stats.flush_database_monitoring_info
PL/SQL procedure successfully completed.
SQL> select count(*) from t where object_id = 9;
COUNT(*)
----------
1
SQL> exec dbms_stats.flush_database_monitoring_info
PL/SQL procedure successfully completed.
SQL> select count(*) from t where object_id = 10;
COUNT(*)
----------
1
SQL>
SQL> select count(*) from t where object_id <= 1;
COUNT(*)
----------
0
SQL> exec dbms_stats.flush_database_monitoring_info
PL/SQL procedure successfully completed.
SQL> select count(*) from t where object_id <= 2;
COUNT(*)
----------
1
SQL> exec dbms_stats.flush_database_monitoring_info
PL/SQL procedure successfully completed.
SQL> select count(*) from t where object_id <= 3;
COUNT(*)
----------
2
SQL> exec dbms_stats.flush_database_monitoring_info
PL/SQL procedure successfully completed.
SQL> select count(*) from t where object_id <= 4;
COUNT(*)
----------
3
SQL> exec dbms_stats.flush_database_monitoring_info
PL/SQL procedure successfully completed.
SQL> select count(*) from t where object_id <= 5;
COUNT(*)
----------
4
SQL> exec dbms_stats.flush_database_monitoring_info
PL/SQL procedure successfully completed.
SQL> select count(*) from t where object_id <= 6;
COUNT(*)
----------
5
SQL> exec dbms_stats.flush_database_monitoring_info
PL/SQL procedure successfully completed.
SQL> select count(*) from t where object_id <= 7;
COUNT(*)
----------
6
SQL> exec dbms_stats.flush_database_monitoring_info
PL/SQL procedure successfully completed.
SQL> select count(*) from t where object_id <= 8;
COUNT(*)
----------
7
SQL> exec dbms_stats.flush_database_monitoring_info
PL/SQL procedure successfully completed.
SQL> select count(*) from t where object_id <= 9;
COUNT(*)
----------
8
SQL> exec dbms_stats.flush_database_monitoring_info
PL/SQL procedure successfully completed.
SQL> select count(*) from t where object_id <= 10;
COUNT(*)
----------
9
SQL> exec dbms_stats.flush_database_monitoring_info
PL/SQL procedure successfully completed.
SQL> select count(*) from t where object_id is null;
COUNT(*)
----------
10
SQL> exec dbms_stats.flush_database_monitoring_info
PL/SQL procedure successfully completed.
SQL> select max(object_id) from t where owner = 'SYS';
MAX(OBJECT_ID)
--------------
979220
SQL> exec dbms_stats.flush_database_monitoring_info
PL/SQL procedure successfully completed.
SQL> select max(object_id) from t where owner = 'PUBLIC';
MAX(OBJECT_ID)
--------------
203656
SQL> exec dbms_stats.flush_database_monitoring_info
PL/SQL procedure successfully completed.
SQL> select max(object_id) from t where owner >= 'OUTLN';
MAX(OBJECT_ID)
--------------
979220
SQL> exec dbms_stats.flush_database_monitoring_info
PL/SQL procedure successfully completed.
SQL> select max(object_id) from t where owner = 'SYSTEM';
MAX(OBJECT_ID)
--------------
163303
SQL> exec dbms_stats.flush_database_monitoring_info
PL/SQL procedure successfully completed.
SQL> select max(object_id) from t where owner = 'GSMADMIN_INTERNAL';
MAX(OBJECT_ID)
--------------
159799
SQL> exec dbms_stats.flush_database_monitoring_info
PL/SQL procedure successfully completed.
SQL> select max(object_id) from t where owner <= 'DBSFWUSER';
MAX(OBJECT_ID)
--------------
903936
SQL> exec dbms_stats.flush_database_monitoring_info
PL/SQL procedure successfully completed.
SQL> select max(object_id) from t where owner = 'AUDSYS';
MAX(OBJECT_ID)
--------------
903936
SQL> exec dbms_stats.flush_database_monitoring_info
PL/SQL procedure successfully completed.
SQL> select max(object_id) from t where owner = 'ORACLE_OCM';
MAX(OBJECT_ID)
--------------
18377
SQL> exec dbms_stats.flush_database_monitoring_info
PL/SQL procedure successfully completed.
SQL> select max(object_id) from t where owner != 'DBSNMP';
MAX(OBJECT_ID)
--------------
994458
SQL> exec dbms_stats.flush_database_monitoring_info
PL/SQL procedure successfully completed.
SQL> select max(object_id) from t where owner = 'APPQOSSYS';
MAX(OBJECT_ID)
--------------
20575
SQL> exec dbms_stats.flush_database_monitoring_info
PL/SQL procedure successfully completed.
SQL> select max(object_id) from t where owner = 'REMOTE_SCHEDULER_AGENT';
MAX(OBJECT_ID)
--------------
20637
SQL> exec dbms_stats.flush_database_monitoring_info
PL/SQL procedure successfully completed.
SQL> select max(object_id) from t where owner = 'XDB';
MAX(OBJECT_ID)
--------------
147576
SQL> exec dbms_stats.flush_database_monitoring_info
PL/SQL procedure successfully completed.
SQL> select max(object_id) from t where owner in ('WMSYS','OJVMSYS');
MAX(OBJECT_ID)
--------------
203615
SQL>
SQL> select object_id from user_objects where object_name = 'T';
OBJECT_ID
----------
998272
SQL> exec dbms_stats.flush_database_monitoring_info
PL/SQL procedure successfully completed.
--
-- make sure col_usage$ is good to go
--
SQL> select * from sys.col_usage$
2 where obj# = ( select object_id from user_objects where object_name = 'T' )
3 @pr
==============================
OBJ# : 998272
INTCOL# : 1
EQUALITY_PREDS : 11
EQUIJOIN_PREDS : 0
NONEQUIJOIN_PREDS : 0
RANGE_PREDS : 2
LIKE_PREDS : 0
NULL_PREDS : 0
TIMESTAMP : 04-NOV-21
FLAGS : 521
==============================
OBJ# : 998272
INTCOL# : 4
EQUALITY_PREDS : 10
EQUIJOIN_PREDS : 0
NONEQUIJOIN_PREDS : 0
RANGE_PREDS : 10
LIKE_PREDS : 0
NULL_PREDS : 1
TIMESTAMP : 04-NOV-21
FLAGS : 553
PL/SQL procedure successfully completed.
--
-- gather our stats and check for histograms etc
--
SQL> exec dbms_stats.gather_table_stats('','T');
PL/SQL procedure successfully completed.
SQL> select COLUMN_NAME,NUM_DISTINCT,
2 ( select decode(count(*),0,to_number(null),count(*)) from all_tab_histograms
3 where owner = all_tab_cols.owner
4 and table_name = all_tab_cols.table_name
5 and column_name = all_tab_cols.column_name ) hist_cnt
6 from all_tab_cols
7 where table_name = 'T'
8 and owner = user
9 order by COLUMN_ID;
COLUMN_NAME NUM_DISTINCT HIST_CNT
------------------------------ ------------ ----------
OWNER 47 47
OBJECT_NAME 66704 2
SUBOBJECT_NAME 684 2
OBJECT_ID 80158 254
DATA_OBJECT_ID 9610 2
OBJECT_TYPE 52 2
CREATED 2084 2
LAST_DDL_TIME 2392 2
TIMESTAMP 2302 2
STATUS 2 2
TEMPORARY 2 2
GENERATED 2 2
SECONDARY 2 2
NAMESPACE 25 2
EDITION_NAME 0
SHARING 4 2
EDITIONABLE 2 2
ORACLE_MAINTAINED 2 2
APPLICATION 1 2
DEFAULT_COLLATION 1 2
DUPLICATED 1 2
SHARDED 1 2
CREATED_APPID 0
CREATED_VSNID 0
MODIFIED_APPID 0
MODIFIED_VSNID 0
26 rows selected.
--
-- export those stats, not really for the data itself, but to get the "style" of the data
-- ie, which cols have histograms etc
--
SQL> exec dbms_stats.export_table_stats('','T',stattab=>'ST',statown=>user);
PL/SQL procedure successfully completed.
--
-- here comes version 2 of the table, it will be smaller (only 50k rows)
--
SQL> drop table t purge;
Table dropped.
SQL>
SQL> create table t as select * from dba_objects where rownum <= 50000;
Table created.
SQL> create index ix1 on t ( object_id );
Index created.
SQL> create index ix2 on t ( owner );
Index created.
--
-- nothing in col_usage
--
SQL> select * from sys.col_usage$
2 where obj# = ( select object_id from user_objects where object_name = 'T' )
3 @pr
PL/SQL procedure successfully completed.
--
-- import the stats
--
SQL> exec dbms_stats.import_table_stats('','T',stattab=>'ST',statown=>user);
PL/SQL procedure successfully completed.
--
-- so that stats are there, but the values are wrong (because they reflect the previous version of T)
--
SQL> select COLUMN_NAME,NUM_DISTINCT,
2 ( select decode(count(*),0,to_number(null),count(*)) from all_tab_histograms
3 where owner = all_tab_cols.owner
4 and table_name = all_tab_cols.table_name
5 and column_name = all_tab_cols.column_name ) hist_cnt
6 from all_tab_cols
7 where table_name = 'T'
8 and owner = user
9 order by COLUMN_ID;
COLUMN_NAME NUM_DISTINCT HIST_CNT
------------------------------ ------------ ----------
OWNER 47 47
OBJECT_NAME 66704 2
SUBOBJECT_NAME 684 2
OBJECT_ID 80158 254
DATA_OBJECT_ID 9610 2
OBJECT_TYPE 52 2
CREATED 2084 2
LAST_DDL_TIME 2392 2
TIMESTAMP 2302 2
STATUS 2 2
TEMPORARY 2 2
GENERATED 2 2
SECONDARY 2 2
NAMESPACE 25 2
EDITION_NAME 0
SHARING 4 2
EDITIONABLE 2 2
ORACLE_MAINTAINED 2 2
APPLICATION 1 2
DEFAULT_COLLATION 1 2
DUPLICATED 1 2
SHARDED 1 2
CREATED_APPID 0
CREATED_VSNID 0
MODIFIED_APPID 0
MODIFIED_VSNID 0
26 rows selected.
--
-- but use the old data to drive the outcome of the new gather
--
SQL> exec dbms_stats.gather_table_stats('','T',method_opt=>'for all columns size REPEAT');
PL/SQL procedure successfully completed.
SQL> select COLUMN_NAME,NUM_DISTINCT,
2 ( select decode(count(*),0,to_number(null),count(*)) from all_tab_histograms
3 where owner = all_tab_cols.owner
4 and table_name = all_tab_cols.table_name
5 and column_name = all_tab_cols.column_name ) hist_cnt
6 from all_tab_cols
7 where table_name = 'T'
8 and owner = user
9 order by COLUMN_ID;
COLUMN_NAME NUM_DISTINCT HIST_CNT
------------------------------ ------------ ----------
OWNER 14 14
OBJECT_NAME 44244 2
SUBOBJECT_NAME 204 2
OBJECT_ID 50000 254
DATA_OBJECT_ID 4564 2
OBJECT_TYPE 42 2
CREATED 899 2
LAST_DDL_TIME 1201 2
TIMESTAMP 1057 2
STATUS 1 2
TEMPORARY 2 2
GENERATED 2 2
SECONDARY 1 2
NAMESPACE 19 2
EDITION_NAME 0
SHARING 4 2
EDITIONABLE 2 2
ORACLE_MAINTAINED 2 2
APPLICATION 1 2
DEFAULT_COLLATION 1 2
DUPLICATED 1 2
SHARDED 1 2
CREATED_APPID 0
CREATED_VSNID 0
MODIFIED_APPID 0
MODIFIED_VSNID 0
26 rows selected.
SQL>
Rating
(3 ratings)
Is this answer out of date? If it is, please let us know via a Comment