Skip to Main Content
  • Questions
  • Uploading and Parsing Multi-byte BLOB to CLOB

Breadcrumb

Question and Answer

Tom Kyte

Thanks for the question, Mark.

Asked: April 02, 2008 - 5:32 pm UTC

Last updated: April 03, 2008 - 9:17 pm UTC

Version: 10.1.4.2

Viewed 10K+ times! This question is

You Asked

Hi Tom, I'm building a plsql web application to collect data from around the world (character set is UTF8). The users will be uploading delimited files and the application will parse the files into rows and columns. The data is uploaded into a blob column, and using a variation on this:
/*******************************************************************************************************************************/
procedure parse_rows (p_pos_id in number) is
v_file_id number; 
l_str1      varchar2(32000);
l_str2      varchar2(32000);
l_leftover  varchar2(32000);
l_chunksize number := 3000;
l_offset    number := 1;
--l_linebreak varchar2(2) := chr(13)||chr(10);
l_linebreak varchar2(4) := chr(00) || chr(13)|| chr(00) || chr(10);
l_length    number;
p_blob blob;
t_count  number :=0;
begin

    select blob_content into p_blob from appl_attachments where attachment_id=v_file_id;
    l_length := dbms_lob.getlength(p_blob);
    while l_offset < l_length loop
       l_str1 := l_leftover ||utl_raw.cast_to_varchar2(dbms_lob.substr(p_blob, l_chunksize, l_offset));
       l_leftover := null;
       l_str2 := l_str1;
       while l_str2 is not null loop
        if instr(l_str2, l_linebreak) <= 0 then
           l_leftover := l_str2;
           l_str2 := null;
        else
           t_count := t_count + 1;
           detrec.unedited_record := substr(l_str2, 1, instr(l_str2, l_linebreak)-1);
           pos_api.POS_RPT_DETS(detrec,'raw_save',g_user_name,g_rc,g_msg);
           l_str2 := substr(l_str2, instr(l_str2, l_linebreak)+2);
        end if;
       end loop;
       l_offset := l_offset + l_chunksize;
    end loop;
    if l_leftover is not null then
      detrec.unedited_record := l_leftover;
      pos_api.POS_RPT_DETS(detrec,'raw_save',g_user_name,g_rc,g_msg);
    end if;                                                                      

end parse_rows;
/**********************************************************************************************************************/


I slice the blob into rows with the delimited data into a CLOB column (detrec.unedited_record). A separate process will parse the delimited data into the final columns. Things work great with english only data.

The trouble is that some of the files will contain multi-byte data (Chinese, Japanese, Korean). The upload process is getting the multi-byte data into the blob (I can export the blob to a file and see the Chinese). It also looks like parsing the rows using a linefeed indicator is working ok. But displaying the CLOB (via htp.print on the webpage seems to lose the multi-byte characters.

Got any suggestions on how to preserve the data? Seems to me something in the "utl_raw.cast_to_varchar2" process needs to change, but I can't seem to find the smoking gun.

Thanks in advance.

and Tom said...

actually, it is your dbms_lob.substr.

when you dbms_lob.substr a blob - it is all about BYTES. you are using l_chunksize. If you have a character that consumes more than one byte - odds are you will split some of them into two pieces...

suggest you convert to clob and then process the CLOB

http://docs.oracle.com/cd/B19306_01/appdev.102/b14258/d_lob.htm#i1020356

Rating

  (3 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

smoking too ...

nameless, April 03, 2008 - 10:31 am UTC

It is also the cast_to_varchar2 since <doco>Multibyte character boundaries are ignored</doco>.

Tom Kyte
April 03, 2008 - 9:10 pm UTC

... dbms_lob.substr(p_blob,
l_chunksize, l_offset) ...

that is the cause - that the subsequent calls happen don't matter anymore - it is that return value that already trunced off the bytes.

Character Sets of Text in BLOBS

Andy Noble, April 03, 2008 - 12:20 pm UTC

I have a similar situation. I have a UTF-8 DB and my users upload CSV files that get placed into a BLOB.

If they are using client's with different character sets then no conversion takes place between the client side character set and the DB's UTF-8 because we're uploading to a BLOB. When I attempt to convert the BLOB to a CLOB no conversion takes place and the CLOB treats the raw bytes from the BLOB as already being UTF-8. If I upload a UTF-8 CSV file in to the BLOB there is no problem.

Is this understanding correct? Is there anyway to "detect" which character set the BLOB holds so that it can be converted in to UTF-8 on its way to a CLOB? Would it be better to upload directly in to a CLOB so that the client character set is known and can be used to translate to UTF-8 at upload time?

Thanks very much
Tom Kyte
April 03, 2008 - 9:17 pm UTC

look in the cgi-environment and you'll get a glimpse into the encoding of the client.

Working Version

Mark Johnson, April 10, 2008 - 7:03 pm UTC

Tom, thanks for the response. I was able to take your suggestions, combined with research and reading the documentation (can you imagine that?) and got the following working.
procedure parse_rows (p_pos_id in number) is
v_file_id number;
l_str1      varchar2(32000);
l_str2      varchar2(32000);
l_leftover  varchar2(32000);
l_chunksize number := 1000;
l_offset    number := 1;
l_linebreak varchar2(2) := chr(13)||chr(10);
l_length    number;
p_blob blob;
t_count  number :=0;
v_clob clob;
v_dest_offset  INTEGER := 1;
v_src_offset    INTEGER := 1;
v_blob_csid number := 0;
v_lang_context number := dbms_lob.default_lang_ctx;
begin

    if rptrec.file_charset = 'ASCII' then 
        v_blob_csid := 1;
    else
        v_blob_csid := 2002; 
    end if; 
    select blob_content into p_blob from appl_attachments where attachment_id=v_file_id;
    DBMS_LOB.CREATETEMPORARY(v_clob,TRUE);
    dbms_lob.CONVERTTOCLOB(v_clob,p_blob,DBMS_LOB.LOBMAXSIZE,v_dest_offset,v_src_offset,v_blob_csid,v_lang_context,g_msg);
    l_length := dbms_lob.getlength(v_clob);
    while l_offset < l_length loop
       l_str1 := l_leftover ||dbms_lob.substr(v_clob, l_chunksize, l_offset);
       l_leftover := null;
       l_str2 := l_str1;
       while l_str2 is not null loop
        if instr(l_str2, l_linebreak) <= 0 then
           l_leftover := l_str2;
           l_str2 := null;
        else
           t_count := t_count + 1;
           detrec.unedited_record := substr(l_str2, 1, instr(l_str2, l_linebreak)-1);
           pos_api.POS_RPT_DETS(detrec,'raw_save',g_user_name,g_rc,g_msg);
           l_str2 := substr(l_str2, instr(l_str2, l_linebreak)+2);
        end if;
       end loop;
       l_offset := l_offset + l_chunksize;
    end loop;
    if l_leftover is not null then
      detrec.unedited_record := l_leftover;
      pos_api.POS_RPT_DETS(detrec,'raw_save',g_user_name,g_rc,g_msg);
    end if;
                                                                           

end parse_rows;


The trick was determining the character_set of the file uploaded by the user. So I'm asking the user to provide it and setting the blob_csid appropriately. Will likely code that to accommodate a list of values and use the NLS_CHARSET_ID function to get the ID.

thanks for all you do here for the community!

More to Explore

CLOB/BLOB/etc

Complete documentation on Securefiles and Large Objects here