Skip to Main Content
  • Questions
  • Convert function not working with CLOB datatype giving garbage value

Breadcrumb

Question and Answer

Connor McDonald

Thanks for the question, Bhanu.

Asked: September 04, 2018 - 1:03 pm UTC

Last updated: September 09, 2018 - 10:32 pm UTC

Version: 12c

Viewed 1000+ times

You Asked

We have some data of type lob and varchar2 which we need to export to another application which is CSV supported i.e. it reads data in form of CSV only. Its legacy application so we need to export data from oracle database in CSV format and need to send it to that application.
Issue is that application only supports ASCII 7 bit character (ASCII code up to 127 only unsigned) for the extended support we convert ASCII code>128 to ASCII code under 127 to minimize the data loss for that we have a mapping table which maps characters from we8mswin1252 (8 bit encoding ) to ASCII 7 bit characters.
Currently, we have a program which does so and written in Ruby but its extremely slow so we want to rewrite it in Oracle.

Hence we need to convert data which is of varchar as well as CLOB type to we8mswin1252 encoding(or any other 8-bit encoding) then use mapping table and to convert data to ASCII 7 bit encoding.
I got convert function which is used for character set conversion and this function is working fine with varchar2 datatype but it's giving some fuzzy characters when used with CLOB. My current encoding is AL32UTF8 for the database.

select value,CONVERT(value,'we8mswin1252'),CONVERT(to_clob(value),'we8mswin1252') from temp;

Board Elections – Shareholder Nominations Board Elections �Shareholder Nominations 䉯慲搠䕬散瑩潮猠阠卨慲敨潬摥爠乯浩湡瑩潮


select * from nls_database_parameters where parameter like '%CHARACTERSET%';

I wrote a custom function in which I break CLOB to chunks and convert to varchar2 to do so but it's taking too much hence want to know why convert is behaving like so if there is any other built function I can try will be very helpful.

Thanks

and we said...

The CONVERT function is a legacy function that should not be used for any application-related character set conversion. You cannot have a WE8MSWIN1252 VARCHAR2 or CLOB string in an AL32UTF8 database in a supported way. Therefore, the result of CONVERT(...,'WE8MSWIN1252'[,'AL32UTF8']) is unsupported by default. CLOB is especially tricky as it uses a special storage encoding in multibyte databases, such as AL32UTF8. CONVERT is useful for some repair tasks only.

The supported way of processing data in character sets other than the database character set is to use the RAW data type and the conversion procedures from the package UTL_I18N. However, UTL_I18N does not support CLOBs. Therefore, you do have to read CLOBs in 32K chunks, convert them to WE8MSWIN1252 with UTL_I18N.STRING_TO_RAW, do your post-processing and write the file with UTL_FILE.PUT_RAW.

Also, you could actually do the translation with TRANSLATE on the AL32UTF8 data (also chunked as TRANSLATE does not accept CLOBs). As the resulting text is US7ASCII, you can write it directly with UTL_FILE.PUT_LINE, as US7ASCII is a proper binary subset of AL32UTF8 and no conversion is necessary. Disclaimer: TRANSLATE is rather slow for large replacement sets (128 characters) in multibyte character sets but maybe it will suffice.

Another option: rewrite your code in C. Read CLOB data with OCI into an AL16UTF16 buffer (AL16UTF16 is the actual CLOB encoding in an AL32UTF8 database so no conversion required) and do the post-processing (accent stripping) in C. I do not really see a need for intermediate conversion to WE8MSWIN1252 in the whole process. When the data is in an AL16UTF16 buffer, go character by character (reading it into an 'int ch' variable in a big-endian way), then do the character conversion. First, check if 'ch' is less than 256, which makes it a WE8ISO8859P1 code, subset of WE8MSWIN1252 (WE8ISO8859P1 codes are numerically equal to the first 256 AL16UTF16 codes). You can use your mapping table to produce the US7ASCII output byte for WE8ISO8859P1 codes. If the value of 'ch' is greater or equal 256, then you can have a few explicit additional mappings (e.g. with 'switch (ch)') for WE8MSWIN1252-specific letters ('š', 'œ', 'ž', 'Ÿ') or punctuation ('‘', '’', '“', '”', '™'), if deemed necessary. For any unmapped character, skip it or replace with space or question mark or any other default replacement character of your choice (Oracle uses '?').

Alternatively, try the same in Java+JDBC.


Rating

  (1 rating)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Bhanu Yadav, September 20, 2018 - 5:58 am UTC

Got it, Thanks Tom.
I got a solution by using unix command too.

More to Explore

SQL

The Oracle documentation contains a complete SQL reference.