Skip to Main Content

Breadcrumb

Question and Answer

Tom Kyte

Thanks for the question, Ram.

Asked: February 21, 2007 - 1:44 pm UTC

Answered by: Tom Kyte - Last updated: October 07, 2009 - 8:48 am UTC

Category: Database - Version: 10.1.0

Viewed 10K+ times! This question is

You Asked

Hi tom,
I am further confused with the character sets and how data is stored.
Could not find find my previous question to post a follow up . So i am posting a
new question.
The following are my DB parameters...
1 NLS_NCHAR_CHARACTERSET AL16UTF16
2 NLS_LANGUAGE AMERICAN
3 NLS_TERRITORY AMERICA
4 NLS_CURRENCY $
5 NLS_ISO_CURRENCY AMERICA
6 NLS_NUMERIC_CHARACTERS .,
7 NLS_CHARACTERSET AL32UTF8
8 NLS_CALENDAR GREGORIAN
9 NLS_DATE_FORMAT DD-MON-RR
10 NLS_DATE_LANGUAGE AMERICAN
11 NLS_SORT BINARY
12 NLS_TIME_FORMAT HH.MI.SSXFF AM
13 NLS_TIMESTAMP_FORMAT DD-MON-RR HH.MI.SSXFF AM
14 NLS_TIME_TZ_FORMAT HH.MI.SSXFF AM TZR
15 NLS_TIMESTAMP_TZ_FORMAT DD-MON-RR HH.MI.SSXFF AM TZR
16 NLS_DUAL_CURRENCY $
17 NLS_COMP BINARY
18 NLS_LENGTH_SEMANTICS BYTE
19 NLS_NCHAR_CONV_EXCP FALSE
20 NLS_RDBMS_VERSION 10.1.0.5.0

My confusion is since my characterset is a multibyte characterset shouldnt a column definition like .....
create table testsb(col1 varchar2(60 char));
take up 4 bytes per character inserted?
so for example if i insert .....
insert into testsb values('TESTING');
Should that not return 7*4 when i run the following query....
select lengthb(col1) from testsb;

But i get 7 when i run this query. So what am i missing here?
Does it use all 4 bytes only for characters that need 4 bytes to store?
what if i had defined the table like...
create table testsb(col1 varchar2(60));
wouldnt that limit to just 60 bytes?

and in that case wont i get a error when i try to insert multibyte characters that might take longer that 60bytes??

CONFUSED .. totally confused. I am not even sure if my questions make sense.

anyway i leave it to the ORACLE to decide... thanks

and we said...

multibyte character sets do not use a fixed number of bytes to represent each and every character.

They use between 0 and N bytes - the word testing - likely just uses 7 one byte characters.

You would need to use "fancier" characters from your character set.

Here is an excerpt from my last book on this topic:

<quote src=Expert Oracle Database Architecture>

Bytes or Characters

The VARCHAR2 and CHAR types support two methods of specifying lengths:
* In bytes: VARCHAR2(10 byte). This will support up to 10 bytes of data, which could be as few as two characters in a multibyte character set.
* In characters: VARCHAR2(10 char). This will support to up 10 characters of data, which could be as much as 40 bytes of information.

When using a multibyte character set such as UTF8, you would be well advised to use the CHAR modifier in the VARCHAR2/CHAR definition¿that is, use VARCHAR2(80 CHAR), not VARCHAR2(80), since your intention is likely to define a column that can in fact store 80 characters of data. You may also use the session or system parameter NLS_LENGTH_SEMANTICS to change the default behavior from BYTE to CHAR. I do not recommend changing this setting at the system level; rather, use it as part of an ALTER SESSION setting in your database schema installation scripts. Any application that requires a database to have a specific set of NLS settings makes for an ¿unfriendly¿ application. Such applications generally cannot be installed into a database with other applications that do not desire these settings, but rely on the defaults to be in place.

One other important thing to remember is that the upper bound of the number of bytes stored in a VARCHAR2 is 4,000. However, even if you specify VARCHAR2(4000 CHAR), you may not be able to fit 4,000 characters into that field. In fact, you may be able to fit as few as 1,000 characters in that field if all of the characters take 4 bytes to be represented in your chosen character set!

The following small example demonstrates the differences between BYTE and CHAR and how the upper bounds come into play. We¿ll create a table with three columns, the first two of which will be 1 byte and one character, respectively, with the last column being 4,000 characters. Notice that we¿re performing this test on a multibyte character set database using the character set AL32UTF8, which supports the latest version of the Unicode standard and encodes characters in a varyiable length fashion using from 1 to 4 bytes for each character:

ops$tkyte@O10GUTF> select *
  2    from nls_database_parameters
  3   where parameter = 'NLS_CHARACTERSET';
 
PARAMETER                      VALUE
------------------------------ --------------------
NLS_CHARACTERSET               AL32UTF8

ops$tkyte@O10GUTF> create table t
2  ( a varchar2(1),
  3    b varchar2(1 char),
  4    c varchar2(4000 char)
  5  )
  6  /
Table created.


Now, if we try to insert into our table a single character that is 2 bytes long in UTF, we observe the following:
ops$tkyte@O10GUTF> insert into t (a) values (unistr('\00d6'));
insert into t (a) values (unistr('\00d6'))
                          *
ERROR at line 1:
ORA-12899: value too large for column "OPS$TKYTE"."T"."A" 
(actual: 2, maximum: 1)


This example demonstrates two things:
* VARCHAR2(1) is in bytes, not characters. We have single Unicode character, but it won¿t fit into a single byte.
* As you migrate an application from a single-byte fixed-width character set to a multibyte character set, you might find that the text that used to fit into your fields no longer does.

The reason for the second point is that a 20-character string in a single-byte character set is 20 bytes long and will absolutely fit in a VARCHAR2(20). However a 20-character field could be as long as 80 bytes in a multibyte character set, and 20 Unicode characters may well not fit in 20 bytes. You might consider modifying your DDL to be VARCHAR2(20 CHAR) or using the NLS_LENGTH_SEMANTICS session parameter mentioned previously when running your DDL to create your tables.

If we insert that single character into a field set up to hold a single character, we will observe the following:

ops$tkyte@O10GUTF> insert into t (b) values (unistr('\00d6'));
1 row created.
ops$tkyte@O10GUTF> select length(b), lengthb(b), dump(b) dump from t;
LENGTH(B) LENGTHB(B) DUMP
---------- ---------- --------------------
         1          2 Typ=1 Len=2: 195,150


That INSERT succeeded, and we can see that the LENGTH of the inserted data is one character¿all of the character string functions work ¿character-wise.¿ So the length of the field is one character, but the LENGTHB (length in bytes) function shows it takes 2 bytes of storage, and the DUMP function shows us exactly what those bytes are. So, that example demonstrates one very common issue people encounter when using multibyte character sets, namely that a VARCHAR2(N) doesn¿t necessarily hold N characters, but rather N bytes.
The next issue people confront frequently is that the maximum length in bytes of a VARCHAR2 is 4,000, and in a CHAR it is 2,000:

ops$tkyte@O10GUTF> declare
  2          l_data varchar2(4000 char);
  3          l_ch   varchar2(1 char) := unistr( '\00d6' );
  4  begin
  5          l_data := rpad( l_ch, 4000, l_ch );
  6          insert into t ( c ) © values ( l_data );
  7  end;
  8  /
declare
*
ERROR at line 1:
ORA-01461: can bind a LONG value only for insert into a LONG column
ORA-06512: at line 6


That shows that a 4,000-character string that is really 8,000 bytes long cannot be stored permanently in a VARCHAR2(4000 CHAR) field. It fits in the PL/SQL variable because in PL/SQL a VARCHAR2 is allowed to be up to 32KB in size. However, when it is stored in a table, the hard limit is 4,000 bytes. We can store 2,000 of these characters successfully:

ops$tkyte@O10GUTF> declare
  2          l_data varchar2(4000 char);
  3          l_ch   varchar2(1 char) := unistr( '\00d6' );
  4  begin
  5          l_data := rpad( l_ch, 2000, l_ch );
  6          insert into t ( c ) values ( l_data );
  7  end;
  8  /
 
PL/SQL procedure successfully completed.
ops$tkyte@O10GUTF> select length( c ), lengthb( c )
  2    from t
  3   where c is not null;
 
LENGTH(C)  LENGTHB(C)
---------- ----------
2000       4000


And as you can see, they consume 4,000 bytes of storage.
</quote>

and you rated our response

  (4 ratings)

Is this answer out of date? If it is, please let us know via a Review

Reviews

February 22, 2007 - 3:32 am UTC

Reviewer: A reader


NLS_LANG and Client install

February 27, 2007 - 6:06 pm UTC

Reviewer: MWise from VA, USA

We are in the process of moving off of an Oracle 8.1.6.1.0 database to 10.2.0.1.0. Our old database character set was US7ASCII and our character set is UTF8. All of our clients are using a 9.2 driver, should we upgrade them to the 10g driver? We found in testing that MsAccess apps are treating longer varchar2 fields now as the MS "memo" field, which is causing lots of problems. I've tried changing my NLS_LANG= setting on my local PC, but the 9.2 driver doesn't recognize the AMERICA_AMERICAN.UTF8 setting. I'm thinking that upgrading to the newest driver would fix this, but I don't to have to migrate a lot of people if this isn't the problem.
We are also having to change a lot of code to use the varchar2(xx char) syntax and I think our SAS programmer is having trouble inserting data because some tables were set up using varchar2(xx bytes).
What is the advantage of moving to UTF8 character set if we don't use expanded character sets nor would be using non-ASCII characters?
Tom Kyte

Followup  

February 28, 2007 - 2:55 pm UTC

if you are never going to use the expanded character sets (multibyte), there is no advantage.

How do you know that unistr('\00d6') is 2 bytes long in UTF

February 19, 2009 - 7:06 am UTC

Reviewer: gerardnico from Nederland

How do you know that unistr('\00d6') is 2 bytes long in UTF. I search a while and I dont' find any information.

I try on my computer to see the difference between byte and char and I don't find any of them with the character set WE8MSWIN1252.

I try to find out where I can have this information but it's very hard.

SQL>  SELECT *
  2    FROM nls_database_parameters
  3  WHERE parameter = 'NLS_CHARACTERSET';

PARAMETER                      VALUE
------------------------------ ----------------------------------------
NLS_CHARACTERSET               WE8MSWIN1252

SQL> DECLARE
  2     Vs_Test VARCHAR2(2 BYTE);
  3  BEGIN
  4     Vs_Test := NULL;
  5     FOR i In 1..2
  6     LOOP
  7     Vs_Test := Vs_Test || UNISTR( '\00d6' );
  8     DBMS_OUTPUT.PUT_LINE( i || ' - Length  - ' || length(Vs_Test));
  9     DBMS_OUTPUT.PUT_LINE( i || ' - Lengthb - ' || lengthb(Vs_Test));
 10     END LOOP;
 11  END;
 12  /
1 - Length  - 1
1 - Lengthb - 1
2 - Length  - 2
2 - Lengthb - 2

PL/SQL procedure successfully completed.




Tom Kyte

Followup  

February 19, 2009 - 8:40 am UTC

I forget how I arrived at that one, but the example shows it is. I had a UTF based database and it told me how big it was.

... I try on my computer to see the difference between byte and char and I don't find any of them with the character set WE8MSWIN1252. ...

no kidding, why would there be? That is a SINGLE BYTE characterset, you would - well - need to use a multibyte character set to see a difference.

Good answer, but...

September 30, 2009 - 10:43 am UTC

Reviewer: A reader from Canada

The 10gR2 Globalization Guide:

http://download-west.oracle.com/docs/cd/B19306_01/server.102/b14225.pdf

says:

"The following expressions use byte semantics:
■ VARCHAR2(20 BYTE)
■ SUBSTRB(string, 1, 20)
Note the BYTE qualifier in the VARCHAR2 expression and the B suffix in the SQL
function name.
The following expressions use character semantics:
■ VARCHAR2(10 CHAR)
■ SUBSTR(string, 1, 10)
Note the CHAR qualifier in the VARCHAR2 expression.
The NLS_LENGTH_SEMANTICS initialization parameter determines whether a new
column of character datatype uses byte or character semantics. The default value of the
parameter is BYTE. The BYTE and CHAR qualifiers shown in the VARCHAR2 definitions
should be avoided when possible because they lead to mixed-semantics databases.
Instead, set NLS_LENGTH_SEMANTICS in the initialization parameter file and define
column datatypes to use the default semantics based on the value of NLS_LENGTH_
SEMANTICS."


What is wrong with a mixed-semantics database?

We have fields in a table that will never contain multi-byte characters (because we generate the value of the fields) and someone would like to keep them as:

varchar2(10)

rather than

varchar2(10 char)

as they are used for indexes, and they feel that this would increase the index size. Is this correct?


Tom Kyte

Followup  

October 07, 2009 - 8:48 am UTC

... What is wrong with a mixed-semantics database? ...

it is confusing, it is not the default, it is not what 99.9999% of the people using the database would expect.

consistency is good.

varchar2 strings - the V stands for VARYING, we only store what we are given, their fears with varchar2 have no basis...