api - and can i use substr for japanese ( utf8)
umesh, March 04, 2003 - 3:20 am UTC
I have an application where user enters more than 2000 characters (bytes) in a text field. I get the data and insert into a column in a chunk of 2000 bytes. i.e if user enters 3000 bytes I take 2000 bytes in a row and the remaining in the next row. So far our application supported only english so no problem.
now our application needs to support japanese, korean etc. i.e utf8 if i take a substr of 2000 for the data entered by user it is going to be byte wise. but i want yi character wise split to happen does oracle provide any api's for the same
thanks
March 04, 2003 - 6:59 am UTC
substr is character wise.
substrB would be byte wise.
you should just use a CLOB and do away with chunking (which since utf8 is a 4byte encoding scheme means you would have to use 1000, not 2000)
but -- USE A CLOB
But how do you split characterwise before a particular byte?
Mike Friedman, March 27, 2005 - 4:17 am UTC
We receive AL32UTF8 data from an external system and need to store it in DB columns. For example, a VARCHAR2(30). Our database is also AL32UTF8.
Business rule is to truncate the incoming data if it won't fit in a column.
If we use s30 := substr(s,1,30) then our string may be too large to fit in a varchar2(30).
If we use s30 := substrb(s,1,30) then our string may be truncated in the middle of a character, producing a final junk character.
How can we truncate at the character boundary at or before the 30th byte?
Best idea I can come up with is to do a binary search using substr() and lengthb() to find an appropriate x for substr(s,1,x) but that's very ugly.
Please advise.
March 27, 2005 - 10:16 am UTC
I guess I would be looking for a bit of procedural code (although, why don't you use varchar2(30 char) instead of 30 bytes?)
something like:
create or replace
function mb_shorten( p_string in varchar2, p_length in number )
return varchar2
as
l_length number := p_length;
l_string varchar2(4000) := substr( p_string, 1, p_length );
begin
while ( lengthb(l_string) > p_length )
loop
l_length := l_length-1;
l_string := substr( l_string, 1, l_length );
end loop;
return l_string;
end;
/
perhaps
Yes... suboptimal but adequate
Mike Friedman, March 27, 2005 - 11:08 am UTC
Procedural code is the fallback, but that's a lot slower than doing something clever with built ins.
As for why not to user varchar2(30 char), several reasons:
1. There is general uncertainty how long characters can be. For example, if Oracle starts supporting the extended Unicode characters will you go to 6 bytes? Because of this it is very hard to know how large the maximum varchar2 size is.
2. Similar issues with maximum length of concatenated indexes, etc.
3. Past sad experience is that many applications and tools do not properly handle character semantics - they assume the number of characters is the number of bytes and break in unfriendly ways.
March 27, 2005 - 11:18 am UTC
well, you have access to all of the builtins I do ;) if you can think of something -- go for it. short of using a big case or decode, I don't see one. case would work
case when lengthb(s30) > 30
then case when lengthb(substr(s30,1,29)) > 30
then case when lengthb(substr(s30,1,28)) > 30
then....
else substr(s30,1,28)
end
else substr(s30,1,29)
end
else s30
end
1) so? you want to store 30 characters apparently, seems pointless almost to store the first 5 characters only. the other 25 are not useful?
2) you are losing data, that is better? I cannot imagine having to index for retrieval purposes data that has been purposely corrupted (truncated)
3) cannot do anything about that -- but the database would report them the maximum data length when asked.
Truncation and sizes
Michael Friedman, March 27, 2005 - 7:57 pm UTC
Short answers:
1. This is just one column, but our system has thousands. Many of them have issues with maximum sizes.
2. We've made the decision that the data that can fit in 30 bytes is enough. If that changes we'll change to varchar2 to be larger... in bytes. And since we use %TYPE notation the migration will not be too painful.
3. I agree with you about truncation, but sadly enough this kind of truncation is a fact of life.
a. We're now integrating with Oracle LDAP. There is no control over field sizes. So we set a business rule and truncate.
b. EDI is a global standard. But both ANSI X12 and EDIFACT place restrictions on the number of bytes in key fields (ie. company name). So we, and everyone else who plays in this space, truncate.
March 28, 2005 - 7:15 am UTC
3) facts of life like this are decisions made, not facts of life. A fact of life is that we will die. That is an idea we have to live with. This is not a fact of life.
Sorry -- but I've given you my ideas here. This is a rather unique situation in my experience -- to have thousands of fields that you are basically saying "truncate, we don't care".