Hi Tom, I'm building a plsql web application to collect data from around the world (character set is UTF8). The users will be uploading delimited files and the application will parse the files into rows and columns. The data is uploaded into a blob column, and using a variation on this:
/*******************************************************************************************************************************/
procedure parse_rows (p_pos_id in number) is
v_file_id number;
l_str1 varchar2(32000);
l_str2 varchar2(32000);
l_leftover varchar2(32000);
l_chunksize number := 3000;
l_offset number := 1;
--l_linebreak varchar2(2) := chr(13)||chr(10);
l_linebreak varchar2(4) := chr(00) || chr(13)|| chr(00) || chr(10);
l_length number;
p_blob blob;
t_count number :=0;
begin
select blob_content into p_blob from appl_attachments where attachment_id=v_file_id;
l_length := dbms_lob.getlength(p_blob);
while l_offset < l_length loop
l_str1 := l_leftover ||utl_raw.cast_to_varchar2(dbms_lob.substr(p_blob, l_chunksize, l_offset));
l_leftover := null;
l_str2 := l_str1;
while l_str2 is not null loop
if instr(l_str2, l_linebreak) <= 0 then
l_leftover := l_str2;
l_str2 := null;
else
t_count := t_count + 1;
detrec.unedited_record := substr(l_str2, 1, instr(l_str2, l_linebreak)-1);
pos_api.POS_RPT_DETS(detrec,'raw_save',g_user_name,g_rc,g_msg);
l_str2 := substr(l_str2, instr(l_str2, l_linebreak)+2);
end if;
end loop;
l_offset := l_offset + l_chunksize;
end loop;
if l_leftover is not null then
detrec.unedited_record := l_leftover;
pos_api.POS_RPT_DETS(detrec,'raw_save',g_user_name,g_rc,g_msg);
end if;
end parse_rows;
/**********************************************************************************************************************/
I slice the blob into rows with the delimited data into a CLOB column (detrec.unedited_record). A separate process will parse the delimited data into the final columns. Things work great with english only data.
The trouble is that some of the files will contain multi-byte data (Chinese, Japanese, Korean). The upload process is getting the multi-byte data into the blob (I can export the blob to a file and see the Chinese). It also looks like parsing the rows using a linefeed indicator is working ok. But displaying the CLOB (via htp.print on the webpage seems to lose the multi-byte characters.
Got any suggestions on how to preserve the data? Seems to me something in the "utl_raw.cast_to_varchar2" process needs to change, but I can't seem to find the smoking gun.
Thanks in advance.
actually, it is your dbms_lob.substr.
when you dbms_lob.substr a blob - it is all about BYTES. you are using l_chunksize. If you have a character that consumes more than one byte - odds are you will split some of them into two pieces...
suggest you convert to clob and then process the CLOB
http://docs.oracle.com/cd/B19306_01/appdev.102/b14258/d_lob.htm#i1020356