Thanks for the question, Sean.

Asked: February 16, 2011 - 10:31 am UTC

Last updated: February 17, 2011 - 11:36 am UTC

Version: 11.2.0

Viewed 10K+ times! This question is

You Asked

Hi Tom, long time reader, first time poster.

I am interested in your opinion on the main reason why not use varchar2(255) or greater when storing data, if you are not 100% sure of what may get inserted into that column over time. Basically why not just use varchar2(4000) and avoid the old ORA-12899: Value to Large for Column error.

Here is my list, I am curious about yours and the readers'

1. Data integrity. Lots of people need to read the data dictionary to build applications, to store data, and truth in advertising here is critical.

The only thing I will say about this #1 is that I have had to integrate plenty of public (govt) data sets into my applications, and let's just say govt agencies are not always that hot at giving you the data types and lengths (or worse, make you accept XML... yuck).

2. Limitation on index creation. If you try to create an index with 2 columns that actually store 2 characters but are defined as varchar2(4000) you are out of luck.

Storage of either the data column or and index on these columns does not seem to come into play for this discussion (although I have heard these myths for many years).

Example of index creation limitation:

drop table idx_size_table;

create table idx_size_table (
thincol varchar2(4000),
fatcol  varchar2(4000));

create index idx_size_test on idx_size_table (thincol,fatcol);

gets you the error:

Error starting at line 8 in command:
create index idx_size_test on idx_size_table (thincol,fatcol)
Error at Command Line:8 Column:30
Error report:
SQL Error: ORA-01450: maximum key length (6398) exceeded
01450. 00000 - "maximum key length (%s) exceeded"
*Cause:
*Action:

Too bad, if thincol was supposed to store a 2 digit country code, and fatcol was supposed to store a 2 digit state code. No index for you.

Example of why storage of indexed columns does not matter:

drop table idx_size_table;

create table idx_size_table (
thincol varchar2(3),
fatcol  varchar2(255));


begin
for i in 1 .. 5000
loop
insert into idx_size_table (thincol, fatcol) values ('abc','abc');
end loop;
end;
/

create index idx_size_thin on idx_size_table (thincol) compute statistics;
create index idx_size_fat on idx_size_table (fatcol) compute statistics;

select index_name, table_name, blevel, leaf_blocks, distinct_keys, avg_leaf_blocks_per_key, avg_data_blocks_per_key from user_indexes;

shows that both indexes have the exact same characteristics and leaf blocks. So the same storage even though one was defined as varchar2(3) and one was defined as varchar2(255) but only had 3 characters in it.

If you change the insert loop to insert.... values ('abc','abcdefghijklmnop')....

and then drop and recreate the index, you will see the differences in number of blocks, not due to the defined length, but to the actual length.

Apologies if I could not find an original post, and if I got the code tagging wrong.

and Tom said...

... , if you are not 100% sure of what may get inserted into that column over time. ...

if that is true, please inform me how you can write your application? And please tell me how you know it won't exceed 4000 bytes???

Data integrity is the only one I really need - that does it for me.
The index could be an issue.

Client side resources could definitely be another. Imagine you are writing an application, it has to handle an arbitrary cursor. Imagine you have a table with 10 columns - the average width of which would be 40 bytes if you "right sized everything" - something are 100 bytes, some are 2 bytes - but the average is 40 bytes (so each row is 400 bytes). Further, suppose you are concerned about performance so you'll array fetch 100 rows at a time.

Let's do the math

40 bytes/column * 10 columns = 400 bytes/row
400 bytes/row * 100 rows = 40,000 bytes per fetch buffer
Assume you have 25 cursors open
25*40,000 = 1,000,000 - 1mb per connection
assume you have a connection pool with 100 connections
100mb

Now, do the math with 4000 bytes maximum per column...

4000 bytes/column * 10 columns = 40000 bytes/row
40000 bytes/row * 100 rows = 4,000,000 bytes per fetch buffer
Assume you have 25 cursors open
25*4mb = 100mb per connection
assume you have a connection pool with 100 connections
10,000mb

ouch.

see also

https://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:1145132537055

Rating

(4 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Perfect

Sean, February 16, 2011 - 2:07 pm UTC

Wow that was quick, thanks exactly what I was looking for (indluding the original post). Sorry, I guess I need to get better at searching your site, I figured it was already asked and answered a thousand times, and in fact I probably read that post before.

I clearly agree that data types should be done properly, and planned for up front. I just like to make sure I understand as many of the why's as well (your reply and original post was perfect).

The I am not sure bit is related to the ETL failing, due to an unannounced change in the data length by someone out of my control, but I do 100% agree that yes in fact, I want it to fail so that I can do something intelligent about it.

Thanks again.

Been there...

djb, February 16, 2011 - 2:10 pm UTC

As someone who can say "I've been there, done that", I'll
say, "Don't do it".

I've dealt with government specs before. If you don't have
definite specs, usually with a little effort you can come
up with reasonable expectations. Code to that, with an eye
to the fact that the specs could change, and testing will
reveal your shortcomings.

What I thought would be a 200 character field turned out to
be around 10 chars max. Still, from 4000 to 200 is a
significant improvement.

I have yet to code *anything* with VARCHAR2(4000)

thanks

Sean Power, February 16, 2011 - 9:06 pm UTC

Nor I, but I have seen tons of ugly stuff and poorly defined columns. I am unaware of Oracle calculating the max length in the data dictionary, but you can always compare the average length to the declared length in user_tab_cols, to give you an idea of where folks have padded the column lengths (average length of 3, but declared length of 255???).

select table_name, column_name, data_type, data_length, avg_col_len from user_tab_cols;

Or you can do something like this that calculates the max length of all your columns, and compares to the declared length. Fair warning, this takes a bit of time as the data gets into the 100s of GBs.

drop table table_col_max;

create table table_col_max (
table_name  varchar2(255),
column_name varchar2(255),
column_id number,
data_type varchar2(255),
data_length number,
max_length  number);

declare
   v_ln number;
begin
   for r in (
            select table_name, column_name, column_id, data_type, data_length
            from user_tab_columns
            order by table_name, column_id
            )
   loop
      execute immediate 'select max(length('||r.column_name||')) from '||r.table_name into v_ln;
      insert into table_col_max (table_name, column_name, column_id, data_type, data_length, max_length)
      values (r.table_name, r.column_name, r.column_id, r.data_type, r.data_length, v_ln);
   end loop;
end;
/

select * From table_Col_max;

February 17, 2011 - 8:12 am UTC

I wasn't talking about Oracle - we just send packets back and forth.

I'm talking about your code, the libraries you might be using. Imagine if you will that you were building "the coolest generic API to databases to end all APIs". What might be your approach? A typical one would be

a) describe query
b) find max column widths
c) compute row size
d) multiply by array fetch size
e) allocate buffer to fetch into

Now, go production (I've seen this API to end all API's at least a thousand times in the last 24 years, I wrote my own once).

Running that select statement wouldn't do anything for you in the code - especially since by the time the query executed and came back - the answer would have changed in the database - think "update", "insert" and "delete", the data is constantly changing.

Not only that, but it wouldn't really work either - select to_char(dt,'dd-mon-yyyy hh24:mi:ss am'), to_number( num, '999,999,999,999,999.999999999' ), a || 'xyz' || b from t; You have to describe a query, you cannot just look at the columns in a table.

Not working part two: select * from a,b,c,d where join_conditions. Knowing the "maximums" for single tables doesn't work - you need it at the query level.

Thanks

A reader, February 17, 2011 - 8:41 am UTC

Wrote your own API :-)

Sorry for being unclear (or clear and dense). I was trying to say if you stumble across a poorly defined schema (e.g. one with hundreds of columns, and 95% of the columns are suspiciously defined as varchar2(255)), then you could use those queries to identify leading suspects to go back to the database owners and say "Umm hey, what's the story here".

I wasn't suggesting using the queries in any way to make the API or code work any better, thanks for clarifying for me.

February 17, 2011 - 11:36 am UTC

ah, got it! thanks

Live SQL

Share and learn SQL and PL/SQL; free access to the latest version of Oracle Database!

Dev Gym

Classes, workouts and quizzes on Oracle Database technologies. Expertise through exercise!

Breadcrumb

Question and Answer

You Asked

and Tom said...

Rating

Comments

Perfect

Been there...

thanks

Thanks

More to Explore

Live SQL

Dev Gym

Reasons Not to Use Varchar2(255-4000) if Actual Length is Shorter

Breadcrumb

Question and Answer

You Asked

and Tom said...

Rating

Comments

Perfect

Been there...

thanks

Thanks

More to Explore

Live SQL

Dev Gym