Skip to Main Content
  • Questions
  • DBMS_STATS.AUTO_SAMPLE_SIZE seems to always generate 100%

Breadcrumb

Question and Answer

Connor McDonald

Thanks for the question, Travis.

Asked: March 21, 2017 - 8:05 pm UTC

Last updated: October 21, 2021 - 1:36 am UTC

Version: 12.1.0.2

Viewed 1000+ times

You Asked

Hello,

I was curious about something that I am running in to. The following happens:

select dbms_stats.get_param('ESTIMATE_PERCENT') from dual;

AUTO_SAMPLE_SIZE

select * from dba_tab_statistics where sample_size != num_rows;

no rows selected

We have over 200 tables all of varying sizes, and this Database is about 7 TB large, with tables containing as many as 600 million rows.

I am very curious why it appears that AUTO_SAMPLE_SIZE is using an ESTIMATE_PERCENT=>100 when gathering the stats for every single table.

Am I looking in the right place for this information?

and Connor said...

That is by design. It *sounds* bad, but its really actually pretty cool.

In earlier versions, the biggest cost to stats was working out the number of distinct values per column. That's a huge sort per column. But we managed to come up with a way of working out that value without sorting - just by scanning the table once top to bottom.

Check this blog post

https://blogs.oracle.com/optimizer/entry/how_does_auto_sample_size

and check the 11g section at the bottom.

Rating

  (2 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Thanks, followup question

Travis Hunt, March 22, 2017 - 2:14 pm UTC

So is ESTIMATE_PERCENT=>AUTO_SAMPLE_SIZE the same exact thing as manually defining ESTIMATE_PERCENT=>100?
Connor McDonald
March 23, 2017 - 3:32 am UTC

No!

estimate=>100, says "Use the old style algorithms, ie the mega sorts etc, and scan the whole table"

estimate=>auto, says "Scan the whole table and use the smart new algorithm to avoid the sorts"


Follow-up Question

Dan, October 14, 2021 - 4:26 pm UTC

estimate=>100, says "Use the old style algorithms, ie the mega sorts etc, and scan the whole table"

estimate=>auto, says "Scan the whole table and use the smart new algorithm to avoid the sorts"


Does that mean then that any estimate_percent value other than auto_sample_size (say 25%) will use the old style algorithms?

The rest of this is really just a comment (or perhaps a suggestion!).

I've noticed that since upgrading from 11g to 12c (19c to be precise) the auto stats collection job never finishes without timing out - literally never. So, I'm trying to find a rational approach to configure automated stats gathering that will allow it to gather stats where needed, while facing ever-shrinking windows when the database is otherwise nearly idle, in a large data warehouse environment where "one size does not fit all". I'm looking at many different strategies, including employing dbms_stats preferences, adjusting auto task windows, incremental stats on certain large partitioned tables, and any helpful strategies I can find.

One thing that would be very helpful in our environment would be to add some flexibility that would allow us to specify preferences at the tablespace level. We have a large tablespace that is used to house data that is staged to be loaded when our ETL processes are executed. We know with certainty that the ETL processes will read every row in those tables every time they are accessed (hence they don't really need statistics - full table scans should always be employed when they are accessed). However, we can't apply dbms_stats preferences at the table level for those objects, since the majority of them are dropped and re-created with each ETL cycle. If there was a way we could configure the auto stats gathering job to simply ignore the objects in specific tablespaces, it would relieve much of the load on the auto stats job.

I know we could create a new schema just for the staged objects and apply dbms_stats preferences at the schema level, but at this point, that would require a significant amount of work for the application development team, and, of course, they are overloaded with work as it is. Same thing with the idea of setting dbms_stats preferences on those tables at creation time.

I have another question to ask about incremental statistics, specifically related to Nigel Bayless' excellent blog on "Efficient Statistics Maintenance", but I'll ask that under separate cover as it's a bit off-topic for this discussion.

Thanks!
Connor McDonald
October 21, 2021 - 1:36 am UTC

Does that mean then that any estimate_percent value other than auto_sample_size (say 25%) will use the old style algorithms?

Pretty much. The percentage ends up being pushed down into SAMPLE clauses on the queries that are run to deduce the data distribution. And don't forget that sampling (say) 2% means (by default) 2% of the rows. If you have 50 rows per block, you might well still visit every block.

re: the other comments

Whilst we don't have a tablespace level filter, you could easily set up your own via (pseudocode)

for i in ( select ... from dba_segments where tablespace_name = '...' ) loop
  dbms_stats.lock_table_stats
end loop


and then explicitly control the stats on those tables manually.

More to Explore

Performance

Get all the information about database performance in the Database Performance guide.