Skip to Main Content

Breadcrumb

Question and Answer

Tom Kyte

Thanks for the question, Mikito.

Asked: September 26, 2001 - 8:34 pm UTC

Answered by: Tom Kyte - Last updated: March 25, 2020 - 1:51 am UTC

Category: Database - Version: 9i

Viewed 1000+ times

You Asked

I've looked into 9i clob manual (wow, whole manual for one built-in datatype!) and still have a question: What index technique can be used to make effective clob comparison in a query like this:

select * from my_table where clob_column = another_clob;

I noticed that whenever I create a table with clob a unique index is created as well. Is it a "clob" identity index?

and we said...

You cannot index a CLOB like that. A clob is potentionally HUGE -- indexes don't work on HUGE things.

We can index the CONTENTS of the clob using Oracle text (aka interMedia).

One method you can use would be to create an index on the checksum of the first 32k of the clob. this checksum is "small". We can use that to effectively reduce the NUMBER of clobs we need to compare to. We can index the checksum of the CLOB and at runtime use a predicate like:

where checksum(clob_column) = :checksum_of_thing_to_look_for
and clob_column = another_clob


For example:

ops$tkyte@ORA9I.WORLD> create table t ( x int primary key, y clob );

Table created.

ops$tkyte@ORA9I.WORLD> create index t_idx on t(checksum(y));

Index created.

ops$tkyte@ORA9I.WORLD>
ops$tkyte@ORA9I.WORLD> insert into t values (1,'hello world' );

1 row created.

ops$tkyte@ORA9I.WORLD> insert into t values (2,'goodbye' );

1 row created.

ops$tkyte@ORA9I.WORLD> insert into t values (3,'this is just a test...' );

1 row created.

ops$tkyte@ORA9I.WORLD>
ops$tkyte@ORA9I.WORLD> set autotrace on
ops$tkyte@ORA9I.WORLD> select t.*
2 from t
3 where checksum(y) = checksum('hello world')
4 and to_char(y) = 'hello world'
5 /

X Y
---------- --------------------------------------------------------------------------------
1 hello world


Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=FIRST_ROWS (Cost=1 Card=1 Bytes=2015)
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'T' (Cost=1 Card=1 Bytes=2015)
2 1 INDEX (RANGE SCAN) OF 'T_IDX' (NON-UNIQUE) (Cost=1 Card=1)


Where I used:

ops$tkyte@ORA9I.WORLD> create or replace function checksum_str( p_str in varchar2 ) return number
2 as
3 l_checksum varchar2(32);
4 l_raw raw(16);
5 begin
6
7 l_checksum :=
8 dbms_obfuscation_toolkit.md5( input_string=> p_str );
9
10 l_checksum := utl_raw.cast_to_raw( l_checksum );
11
12 return to_number( substr(l_checksum,1,16), lpad('x',33,'x') )+
13 to_number( substr(l_checksum,17), lpad('x',33,'x') );
14
15 end;
16 /

Function created.

ops$tkyte@ORA9I.WORLD> create or replace function checksum( p_lob in clob ) return number
2 DETERMINISTIC
3 as
4 l_checksum varchar2(32);
5 l_raw raw(16);
6 begin
7
8 l_checksum :=
9 dbms_obfuscation_toolkit.md5( input_string=> dbms_lob.substr(p_lob,32000,1) );
10
11 l_checksum := utl_raw.cast_to_raw( l_checksum );
12
13 return to_number( substr(l_checksum,1,16), lpad('x',33,'x') )+
14 to_number( substr(l_checksum,17), lpad('x',33,'x') );
15
16 end;
17 /

Function created.

as simple checksum functions.

and you rated our response

  (8 ratings)

Is this answer out of date? If it is, please let us know via a Review

Reviews

cool!

September 27, 2001 - 1:01 pm UTC

Reviewer: Mikito Harakiri

BTW, I was trying a similar thing

CREATE INDEX my_idx ON my_table (SUBSTR(clob_column,1,100));

and it didn't work


explicit hash function in the equality predicate

September 27, 2001 - 1:07 pm UTC

Reviewer: Mikito Harakiri

On pessimistic note, the ideal solution would avoid explicitly mentioning hash function in the equality predicate -- it is physical access implementation detail after all. I guess this is too much to ask from the optimizer today...

no equality predicate on clobs

September 27, 2001 - 9:50 pm UTC

Reviewer: Mikito Harakiri

Hmm... It took me a while to notice to_char function in your predicate.

There seems no equality predicate on clobs to begin with. CLOB manual says:

Non-Supported SQL Functionality for LOBs
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The following SQL functionality is not supported for LOBs because the functions are either infrequently used or have easy workarounds.
...
Operators: >, <, =, !=



Tom Kyte

Followup  

September 28, 2001 - 6:58 am UTC

well, you seem to be getting less and less "happy" with this answer.

The fact is CLOBS are big, huge, not really things that you want to be doing equals on -- you have to force it in some cases.

I would question why you need to do this -- compare clobs -- it will be slow, the tips here will help you make it faster, more realistic....

no equality predicate on clobs

September 28, 2001 - 1:37 pm UTC

Reviewer: Mikito Harakiri

I'm collecting sql statements from huge ERP application. Most of the statements are reasonably short, while the longest one is 30K bytes long. I need to be able to identify identical statements, as I also would like to peek into V$SQL. Therefore, I'm not comparing huge clobs at all.

Even if I would, knowing that todays database limits are somewhere under 1B records, we can think of hash function that can make this comparison efficient. This hash heed to have a range > 1B*1B in order to eleminate chances of collision.

On the other hand, you are probably correct, naive literal comparison of clobs woldn't get me too far. In my example, ignoring differences in the names of bind variables would probably be better comparison criteria. Looks like for big objects only user-defined equality makes sence.





Another way of computing a checksum for anysize clob

March 17, 2003 - 5:11 pm UTC

Reviewer: Yves Azie

create or replace and compile
java source named "GetChecksum"
as
import java.io.*;
import java.lang.*;
import java.security.*;

/**
* This class takes in a clob, and computes
* a unique hash value based on the md5 hash algorythm
*/
public class GetChecksum extends Object
{
public static float get(oracle.sql.CLOB clobToCompute) throws Exception
{
BufferedReader clobReader = new BufferedReader(clobToCompute.getCharacterStream());
String s = null;
StringBuffer sb = new StringBuffer();
while ((s = clobReader.readLine()) != null) {
sb.append(s);
}
clobReader.close();

byte[] clobBytes = sb.toString().getBytes();
MessageDigest md5 = MessageDigest.getInstance("MD5");
byte[] clobResult = md5.digest(clobBytes);

ByteArrayInputStream bytestream1 = new ByteArrayInputStream (clobResult);
DataInputStream bytestream2 = new DataInputStream( bytestream1 );

float clobFloat = bytestream2.readFloat();
return clobFloat;
}
}
/
show err

create or replace function checksum ( p_lob in clob )
return number
DETERMINISTIC
as
language java
name 'GetChecksum.get(oracle.sql.CLOB) return float';
/
show err


You may then use this function in SQL as follows:
select checksum(clobfield) from clobTable;

Comparing CLOBs

June 22, 2005 - 5:18 pm UTC

Reviewer: Anirudh from NJ, USA

Hi Tom,

We have this requirement of comparing two CLOBS or rather do a stuff which is similar to "Track Changes" in Microsoft Word. We would like to have a program which would compare two CLOBs and return the differences (line no., word no. or whatever).
I figure that there is no simple and easy way to go about it, can you please suggest an approach which would help us in doing this job.
Again, our requirement is just to compare two CLOBS and point out the differences to the users.

Please feel free to ignore this if you feel this question is irrelevant to the context of the parent topic.

Thanks as always.
Anirudh



Tom Kyte

Followup  

June 23, 2005 - 1:51 pm UTC

there is no simple way to do this, you'd have to implement it or find an implementation of it...

MD5 checksum on a clob

August 11, 2005 - 9:41 am UTC

Reviewer: Carsten RĂ¼thel from BY Germany

Hi,

we have modified Yves Azie's code to get "real" MD5 in a 32 digit hex code.

create or replace and compile
java source named "GetChecksum"
as
import java.io.*;
import java.lang.*;
import java.security.*;

/**
* This class takes in a clob, and computes
* a unique hash value based on the md5 hash algorythm
*/
public class GetChecksum extends Object
{

public static String byteToHex(byte b){
int i = b & 0xFF;
String res = Integer.toHexString(i);
if ( res.length() == 0 ) return ( "00" );
if ( res.length() == 1 ) return ( "0"+res );
return ( res );
}

public static String byteArr2Hex( byte[] b ) {
String res = new String();
for ( int ix=0; ix < b.length; ++ix ) {
res += byteToHex(b[ix]);
}
return ( res );
}

public static String get(oracle.sql.CLOB clobToCompute) throws Exception
{
BufferedReader clobReader = new BufferedReader(clobToCompute.getCharacterStream());
String s = null;
StringBuffer sb = new StringBuffer();
while ((s = clobReader.readLine()) != null) {
sb.append(s);
}
clobReader.close();

byte[] clobBytes = sb.toString().getBytes();
MessageDigest md5 = MessageDigest.getInstance("MD5");
byte[] clobResult = md5.digest(clobBytes);

return ( byteArr2Hex ( clobResult ));
}
}


Is there a mistake in the checksum function ?

March 24, 2020 - 6:39 am UTC

Reviewer: Neil Smith from UK

Is there a mistake ?

You've declared a variable l_raw which isn't used, and then there's this line:
    l_checksum := utl_raw.cast_to_raw( l_checksum );

that will cast it to raw and then implicitly cast it back to varchar2 as it's being assigned back to the l_checksum variable. i.e. do nothing.

should that be
    l_raw := utl_raw.cast_to_raw( l_checksum );

and then use l_raw in place of l_checksum afterwards, or was l_raw and the cast_to_raw not required at all, or am I missing some subtle effect of the cst_to_raw?
Connor McDonald

Followup  

March 25, 2020 - 1:51 am UTC

Not really a mistake just a redundant variable and a touch of laziness.

l_checksum := utl_raw.cast_to_raw( l_checksum );

will

- convert the checksum to a raw
- then due to the datatype mismatch, convert that raw to a char to store in l_checksum

So the net result is the 32byte char represetnation of the 16byte raw being stored.

"l_raw" can be dispensed with

But nice catch!