You Asked

We have a column with varchar2(2000) and the column is indexed. But first few characters can be same for multiple rows. but when the first few string length goes it narrows the numbers of hits.

Our data is like a nested directory structure.

I reading an article about this

http://www.bwong64.com/oracles-32-character-limit-in-generating-histograms/

"if you have a column that is wider than 32 characters AND its values are longer than 32 characters, Oracle will only consider the first 32 characters in the column in generating histograms. In other words, if the first 32 characters don’t vary much in all the rows or it’s identical in all rows, then the histograms generated or the absence of them can be way off. As a consequence, the optimizer will make wrong decisions, doing range scans when it shouldn’t, thinking that more rows will match the criteria than there should be."

One of the option is reverse key index but that is not effective in case of range scan ...

Could you please explain and provide your valuable thoughts and solutions

Thanks
Rama

and Tom said...

see
http://www.oracle.com/technetwork/issue-archive/2011/11-jan/o11asktom-195084.html

for more info.

In there I wrote:

<quote>

You should be aware of this histogram implementation, however, because it could prompt you to store some data attributes differently. For example, suppose you have a system that stores URLs in a table—an audit trail, for example. If you look at a URL such as the ones you see on asktom.oracle.com, you’ll discover that the first 39 characters of my URL are constant—

http://asktom.oracle.com/pls/asktom/f?p

—regardless of what page you end up on in my site. Having a histogram on that data would not be very useful.

However, what if I stored the data in a more meaningful way, such as storing the URL in a series of fields—specifically, using the fields PROTOCOL, HOST, PORT, and PATH? In this case, the PROTOCOL field would contain values such as HTTP or HTTPS, the HOST field would store asktom .oracle.com (and other host names), the PORT field would contain one of the standard port numbers used by typical HTTP, and the PATH field would be the unique bit. You have the ability to render the original URL in its entirety, so you haven’t lost any data. And you have the new ability to provide even better searching capabilities, because you have the components all split out. You also gain the ability to compress this data in the database very easily. (Oracle’s basic and advanced compression for tables works by removing repetitive attributes in a block. PROTOCOL, HOST, and PORT values would all repeat often.) And you would now have an attribute (PATH) that would be able to support a histogram nicely.

This is just another case where knowing how things work enables you to implement a solution more efficiently or at least justify why you have chosen to do something in a particular fashion.

</quote>

so, for example, if were storing file names - I would probably store the path in one attribute and the file name in another. That would give me two search fields - each with their own 32 character histogram. I might even go a step further and store "mount point", "path", "filename", "extension" - since mount point (the parent directory path) would repeat ALOT, path less often, filename less so, and so on. Giving us more attributes to gather statistics on.

Rating

(2 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

32 Character Limit in Generating Histograms

Rama Venkataraman, April 11, 2012 - 3:12 pm UTC

Thanks for your quick response. It was very helpful and we could validate our understanding too.

Thanks
Rama

Multibyte?

Dave Rabone, April 11, 2012 - 6:26 pm UTC

I seem to recall reading that the histogram width was actually 32 bytes, not 32 characters, so that the situation is worse when we have a multibyte character set. Is that still true in most recent versions?

I'd be tempted to split protocol, host, port and virtual directory out into another table and use a surrogate key.
Space saving without compression, and an easy update when the site moves.

April 11, 2012 - 6:34 pm UTC

correct, from the same linked to article:

<quote>
That aside, yes, you are correct that for varchar2 data, only so many bytes of information are stored in the histogram for each of the 254 values. That is, the leading edge of the varchar2 string, not necessarily the entire string, is stored in the histogram. I stress the word bytes here because in many cases, bytes are not the same as characters in a varchar2 string. If you are using a multibyte character set, it is quite possible that 32 bytes is able to store far fewer than 32 characters.
</quote>

I should have been more precise in my answer above - avoiding the word characters.

I'd be tempted to split protocol, host, port and virtual directory out into
another table and use a surrogate key.
Space saving without compression, and an easy update when the site moves.

the example was an audit trail.

There are no updates to an audit trail.

But yes, there could be something to be said for moving the protocol/host/port - and maybe the directory - out to a parent table, at the cost of making it much more less efficient to insert into the audit trail and less efficient to query.

Better at space perhaps, not as good at insert rate or query recall.

More to Explore

Security

All of the vital components for a secure database are covered in the Security guide.

Breadcrumb

Question and Answer