Hi Tom,
I'm a big fan of yours and this site and very happy to be able to ask my first question.
1) This question is about performance when dealing with CLOB. My stored procedure reads records from a table, breaks each record into tokens by a known delimiter and inserts the resulted tokens into another table. The records themselves can be > 4000 but definitely less than 8000 chars, so I store them as CLOB in a table. The tokens are definitely < 4000, so they fit into a sql table type. I decided to try a sql approach, rather than processing using cursor -- for this purpose I adapted your function in_list from </code>
http://asktom.oracle.com/pls/asktom/f?p=100:11:::::P11_QUESTION_ID:210612357425, <code>
and changed the input parameter to type CLOB. However, I found that this change by itself dramatically decreases performance. If I know in advance that the records will be < 4000 chars, I can use a function that is practically identical, except declared to take a VARCHAR2, and it performs much better. An alternative here would be to process records one row at a time, in which case I can simply do l_data := in_list(mystring); and then insert values from l_data. In this case, because I'm calling from pl/sql context in_list will not complain if the string is > 4000 chars. What would you advise?
Here's the function:
REATE OR REPLACE FUNCTION list_to_two_col_table_large(
in_string IN CLOB,
in_separator IN VARCHAR2 := ','
)
RETURN two_col_table_type
IS
l_string CLOB;
l_data two_col_table_type;
l_val VARCHAR2(4000);
l_pos NUMBER;
l_start_pos NUMBER;
BEGIN
l_data := two_col_table_type();
l_string := in_string;
IF in_string IS NULL OR LENGTH(in_string) = 0 THEN
RETURN l_data;
END IF;
l_start_pos := 1;
LOOP
EXIT WHEN l_start_pos > LENGTH(l_string);
l_pos := INSTR(l_string, in_separator, l_start_pos);
IF l_pos < 1 THEN
l_pos := LENGTH(l_string) + 1;
END IF;
l_val :=
LTRIM(RTRIM(SUBSTR(l_string, l_start_pos, l_pos - l_start_pos)));
IF (l_val IS NOT NULL) THEN
l_data.EXTEND;
l_data(l_data.COUNT) := two_col_rec_type(NULL, l_val);
END IF;
l_start_pos := l_pos + LENGTH(in_separator);
END LOOP;
FOR i IN 1 .. l_data.COUNT
LOOP
l_data(i).column_int_value := l_data.COUNT;
END LOOP;
RETURN l_data;
END list_to_two_col_table_large;
--a function list_to_two_col_table is identical, except that
--all CLOB references are replaced with VARCHAR2 and l_string is --declared as VARCHAR2(32000);
--I am not including it here to conserve space
create or replace type two_col_rec_type as object
(
column_int_value number,
column_str_value varchar2(4000)
);
create or replace type two_col_table_type as table of two_col_rec_type;
CREATE TABLE records_table
(
ROW_ID NUMBER(10,0) NOT NULL,
RECORD CLOB,
CONSTRAINT PK_records_table PRIMARY KEY(ROW_ID)
)
CREATE TABLE fields_table
(
ROW_ID NUMBER(10,0) NOT NULL,
SEQ_NO NUMBER(10,0),
FIELD_ID NUMBER(10,0),
RECORD VARCHAR2(4000)
)
truncate table fields_table;
commit;
declare
--l_sep varchar2(1);
l_max_count integer;
begin
l_max_count := 50000;
--l_sep := ';';
INSERT INTO fields_table
SELECT *
FROM (SELECT row_id, t.column_int_value AS seq_no, 0 AS field_id,
t.column_str_value AS RECORD
FROM records_table o,
TABLE(list_to_two_col_table_large(o.RECORD, ';')) t)
WHERE ROWNUM < l_max_count;
end;
--done in 107.406 seconds
declare
--l_sep varchar2(1);
l_max_count integer;
begin
l_max_count := 50000;
--l_sep := ';';
INSERT INTO fields_table
SELECT *
FROM (SELECT row_id, t.column_int_value AS seq_no, 0 AS field_id,
t.column_str_value AS RECORD
FROM records_table o,
TABLE(list_to_two_col_table(o.RECORD, ';')) t)
WHERE ROWNUM < l_max_count;
end;
--done in 6.359 seconds
Here's a sample of data (in record_table)
insert into record_table values
(1, '[0!,];[1!,];[3!,];[8!,];[10!,];[11!,];[12!,];[13!,];[17!,];[18!,];[29!,];[31!,];[33!,];[47!,];[49!,];[50!,];[51!,];[52!,];[53!,];[57!,];[58!,];[59!,];[60!,]')
insert into record_table values
(2, '[0!,];[1!,];[3!,];[8!,];[10!,];[11!,];[12!,];[13!,];[14!,];[17!,];[18!,];[31!,];[33!,];[47!,];[49!,];[50!,];[51!,];[52!,];[53!,];[57!,];[58!,];[59!,];[60!,]')
and so on.
record_table in my test had 10000 rows.
When inserting into fields_table I would limit number of rows to 50000. In 'real' program I may have about 50000 rows in record_table, and number of tokens in each record usually varies from 10 to 100, although I don't have any actual limit.
In this particular case all the records are < 300 chars, so
I may be wasting your time, because the obvious solution would be to redesign the table and avoid CLOB altogether.
However, I have other stored procedures where the record is a string that consists of words (separated by spaces) and I have to tokenize that string into words. In that case, I had some records > 4000 chars, although the majority were < 4000 chars.
Also, I observed a very weird thing:
if in list_to_two_col_table_large I replace all occurrences of l_string with in_string (that is, do not copy the input parameter into the local variable), the performance goes to hell -- I can just see in v$sort_usage the LOB_DATA rising and rising (I usually terminate the statement when it gets to 200 Mb). I specifically tried in the code to avoid modifying the input CLOB, but I guess it doesn't help. When I copy in_string into l_string and then use l_string, the performance is 'normal' and the LOB_DATA is about 2048 Kb. The performance is still orders of magnitude worse than if the function is declared as taking VARCHAR2.
Also, a minor annoyance is that in the insert statement I had to hard-code the separator passed to the table function. If I tried to pass in a local variable (and initialize it outside the insert statement), I would get a run-time error (something like 'cannot select from unnested value').
your comments will be much appreciated.
thanks,
ilya
well, we can certainly optimize this plsql -- for sure. In fact, I ran a comparision of a couple of 156 byte and 64k clobs with your and my "enhanced" code to find:
ops$tkyte@ORA920PC> declare
2 l_max_count integer;
3 begin
4 l_max_count := 50000;
5 INSERT INTO fields_table
6 SELECT *
7 FROM (SELECT row_id, t.column_int_value AS seq_no, 0 AS field_id,
8 t.column_str_value AS RECORD
9 FROM records_table o,
10 TABLE(list_to_two_col_table_large(o.RECORD, ';')) t)
11 WHERE ROWNUM < l_max_count;
12 end;
13 /
PL/SQL procedure successfully completed.
Elapsed: 00:12:57.73
ops$tkyte@ORA920PC>
ops$tkyte@ORA920PC> declare
2 l_max_count integer;
3 begin
4 l_max_count := 50000;
5 INSERT INTO new_fields_table
6 SELECT *
7 FROM (SELECT row_id, t.column_int_value AS seq_no, 0 AS field_id,
8 t.column_str_value AS RECORD
9 FROM records_table o,
10 TABLE(demo_pkg.new_one(o.RECORD, ';')) t)
11 WHERE ROWNUM < l_max_count;
12 end;
13 /
PL/SQL procedure successfully completed.
Elapsed: 00:00:43.31
ops$tkyte@ORA920PC>
ops$tkyte@ORA920PC> select * from (select * from fields_table MINUS select * from new_fields_table ) where rownum <= 5;
no rows selected
Elapsed: 00:00:00.32
ops$tkyte@ORA920PC> select * from (select * from new_fields_table MINUS select * from fields_table ) where rownum <= 5;
no rows selected
Elapsed: 00:00:00.24
so, we can get this to run about 18 times faster just by tuning the logic a bit.
The "devices" i used
a) dbms_profiler
b) my programming background
I used tricks like "hey, stack variables are expensive to allocate/deallocate, so lets not do that". "instead of calling a function that will return the same value over and over -- lets just call the function once and remember the value for that call". "strings are faster than clobs, so if we can move the clob into a string -- we can make this go faster -- much faster" "if we allocate lots of entries in our collection variable - that is faster than doing it one at a time (dbms_profiler was useful there, much time spent in l_data.extend).
My modified routine became a package (so we could avoid stack variables):
create or replace package demo_pkg
as
function new_one( in_string in clob, in_separator in varchar2 default ',' ) return two_col_table_type;
end;
/
create or replace package body demo_pkg
as
l_string clob;
l_pos number;
l_start_pos number;
l_length number;
l_data two_col_table_type := two_col_table_type();
l_sep_length number;
l_cnt number;
l_real_str long;
/* I moved my stack variables to the body -- so they would get allocated once, not once per call...*/
FUNCTION new_one( in_string IN CLOB,
in_separator IN VARCHAR2 := ','
) RETURN two_col_table_type
IS
BEGIN
l_string := in_string;
l_length := dbms_lob.getLength(l_string);
/* calling dbms_lob.getlength over and over was killing you */
IF in_string IS NULL OR l_length = 0
THEN
return two_col_table_type();
END IF;
l_start_pos := 1;
l_sep_length := length(in_separator); -- every little bit helps
l_cnt := 0;
if ( l_length < 32760 )-- optimize for clobs <32, the vast majority
then
l_real_str := l_string || in_separator;
loop
l_pos := instr( l_real_str, in_separator );
exit when (nvl(l_pos,0) = 0);
if ( l_data.count = l_cnt ) -- allocate chunks at a time
then
l_data.extend( trunc(l_length/5) );
end if;
l_cnt := l_cnt+1;
l_data(l_cnt) := two_col_rec_type(NULL,
TRIM(substr(l_real_str,1,l_pos-1)) );
l_real_str := substr( l_real_str, l_pos+1 );
-- removed as many temporaries (l_val) as I could...
if ( l_data(l_cnt).column_str_value is null )
then
l_cnt := l_cnt-1;
end if;
end loop;
else
while ( l_start_pos <= l_length )
LOOP
l_pos := dbms_lob.instr(l_string, in_separator, l_start_pos);
IF l_pos < 1 THEN
l_pos := l_length + 1;
END IF;
if ( l_data.count = l_cnt )
then
l_data.extend( trunc(l_length/5) );
end if;
l_cnt := l_cnt+1;
l_data(l_cnt) := two_col_rec_type(NULL,
TRIM(dbms_lob.substr(l_string, l_pos - l_start_pos,
l_start_pos )) );
if ( l_data(l_cnt).column_str_value is null )
then
l_cnt := l_cnt-1;
end if;
l_start_pos := l_pos + l_sep_length;
END LOOP;
end if;
FOR i IN 1 .. l_cnt
LOOP
l_data(i).column_int_value := l_cnt;
END LOOP;
if ( l_data.count <> l_cnt )
then
l_data.trim( l_data.count-l_cnt );
end if;
RETURN l_data;
END;
end;
/
and that was it, hope that helps some.