Skip to Main Content

Breadcrumb

Question and Answer

Tom Kyte

Thanks for the question, Reji.

Asked: June 22, 2001 - 11:37 am UTC

Last updated: March 28, 2005 - 7:15 am UTC

Version: 8.1.7

Viewed 10K+ times! This question is

You Asked

Tom:
We have a VARCHAR2(4000) column in a UTF8 database. When we try to
insert a string which is 4000 bytes the oracle server giving error message "the value is too large and giving a size of 4496 (around this number ))". We are using JDBC thin driver (jdbc2 thin driver ).
Is it a jdbc limitation or we have to do something special here. When we try to reduce the size of the input string to 3000 it worked
fine.

Please explain why we are getting this behavior.

Thx
Reji

and Tom said...

varchar2(4000) holds 4000 BYTES.

A string which is 4000 CHARACTERS in UTF8 may be MUCH larger then 4000 BYTES. It could be 16000 BYTES.

This is not a jdbc limitation, it is rather a fact of UTF8 and multi-byte character sets in general. They (by definition) need more space. A varchar2(4000) can hold between a 1000 and 4000 character UTF8 string.

Rating

  (4 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

api - and can i use substr for japanese ( utf8)

umesh, March 04, 2003 - 3:20 am UTC

I have an application where user enters more than 2000 characters (bytes) in a text field. I get the data and insert into a column in a chunk of 2000 bytes. i.e if user enters 3000 bytes I take 2000 bytes in a row and the remaining in the next row. So far our application supported only english so no problem.
now our application needs to support japanese, korean etc. i.e utf8 if i take a substr of 2000 for the data entered by user it is going to be byte wise. but i want yi character wise split to happen does oracle provide any api's for the same
thanks

Tom Kyte
March 04, 2003 - 6:59 am UTC

substr is character wise.

substrB would be byte wise.

you should just use a CLOB and do away with chunking (which since utf8 is a 4byte encoding scheme means you would have to use 1000, not 2000)

but -- USE A CLOB

But how do you split characterwise before a particular byte?

Mike Friedman, March 27, 2005 - 4:17 am UTC

We receive AL32UTF8 data from an external system and need to store it in DB columns. For example, a VARCHAR2(30). Our database is also AL32UTF8.

Business rule is to truncate the incoming data if it won't fit in a column.

If we use s30 := substr(s,1,30) then our string may be too large to fit in a varchar2(30).

If we use s30 := substrb(s,1,30) then our string may be truncated in the middle of a character, producing a final junk character.

How can we truncate at the character boundary at or before the 30th byte?

Best idea I can come up with is to do a binary search using substr() and lengthb() to find an appropriate x for substr(s,1,x) but that's very ugly.

Please advise.


Tom Kyte
March 27, 2005 - 10:16 am UTC

I guess I would be looking for a bit of procedural code (although, why don't you use varchar2(30 char) instead of 30 bytes?)


something like:

create or replace
function mb_shorten( p_string in varchar2, p_length in number )
return varchar2
as
l_length number := p_length;
l_string varchar2(4000) := substr( p_string, 1, p_length );
begin
while ( lengthb(l_string) > p_length )
loop
l_length := l_length-1;
l_string := substr( l_string, 1, l_length );
end loop;
return l_string;
end;
/


perhaps

Yes... suboptimal but adequate

Mike Friedman, March 27, 2005 - 11:08 am UTC

Procedural code is the fallback, but that's a lot slower than doing something clever with built ins.

As for why not to user varchar2(30 char), several reasons:

1. There is general uncertainty how long characters can be. For example, if Oracle starts supporting the extended Unicode characters will you go to 6 bytes? Because of this it is very hard to know how large the maximum varchar2 size is.

2. Similar issues with maximum length of concatenated indexes, etc.

3. Past sad experience is that many applications and tools do not properly handle character semantics - they assume the number of characters is the number of bytes and break in unfriendly ways.

Tom Kyte
March 27, 2005 - 11:18 am UTC

well, you have access to all of the builtins I do ;) if you can think of something -- go for it. short of using a big case or decode, I don't see one. case would work

case when lengthb(s30) > 30
then case when lengthb(substr(s30,1,29)) > 30
then case when lengthb(substr(s30,1,28)) > 30
then....
else substr(s30,1,28)
end
else substr(s30,1,29)
end
else s30
end



1) so? you want to store 30 characters apparently, seems pointless almost to store the first 5 characters only. the other 25 are not useful?

2) you are losing data, that is better? I cannot imagine having to index for retrieval purposes data that has been purposely corrupted (truncated)

3) cannot do anything about that -- but the database would report them the maximum data length when asked.

Truncation and sizes

Michael Friedman, March 27, 2005 - 7:57 pm UTC

Short answers:

1. This is just one column, but our system has thousands. Many of them have issues with maximum sizes.

2. We've made the decision that the data that can fit in 30 bytes is enough. If that changes we'll change to varchar2 to be larger... in bytes. And since we use %TYPE notation the migration will not be too painful.

3. I agree with you about truncation, but sadly enough this kind of truncation is a fact of life.
a. We're now integrating with Oracle LDAP. There is no control over field sizes. So we set a business rule and truncate.
b. EDI is a global standard. But both ANSI X12 and EDIFACT place restrictions on the number of bytes in key fields (ie. company name). So we, and everyone else who plays in this space, truncate.

Tom Kyte
March 28, 2005 - 7:15 am UTC

3) facts of life like this are decisions made, not facts of life. A fact of life is that we will die. That is an idea we have to live with. This is not a fact of life.


Sorry -- but I've given you my ideas here. This is a rather unique situation in my experience -- to have thousands of fields that you are basically saying "truncate, we don't care".