Skip to Main Content
  • Questions
  • 19c Oracle Text Index(CTXSYS.CONTEXT ) - String search anomaly

Breadcrumb

Question and Answer

Chris Saxon

Thanks for the question, Sabari.

Asked: March 03, 2020 - 12:48 pm UTC

Last updated: March 21, 2023 - 5:46 pm UTC

Version: Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production

Viewed 10K+ times! This question is

You Asked

We have a table with a blob column that stores the blob of various files like PDF, TIF, WORD, etc and the blob field has Oracle text index INDEXTYPE IS CTXSYS.CONTEXT for us to be able to search for text within the blob’s. We are seeing that the search query using the CONTAINS is giving unexpected results.

Below are the steps we followed:
1) Create the table with the CTXSYS.CONTEXT index on the blob field
2) Populate the table with one blob (PDF) which is OCR’d and has words like ‘executed’, ‘correspondence’, ‘purchase’, ‘original’, ‘marathon’ BUT doesn’t have the word ‘w2’
3) Execute ctx_ddl.sync_index to sync the index.
4) Run the search query.

The search query doesn't find the words that are in the PDF but finds words that are NOT in the PDF. Why would the search results be different?

The same works as expected in Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production. I have compared all the stopwords between 19 vs 12.1.0.2 and couldn't any anything different.

Thanks,

Below are the scripts.
======================================================================
--- TABLE SCRIPT
CREATE TABLE OTXT_ORC_IMAGES_TGT_TEST4
(
  ISN                   NUMBER                  NOT NULL,
  DOCUMENT_BLOB         BLOB,
  DOCUMENT_BLOB_FORMAT  VARCHAR2(30 BYTE)
);

CREATE INDEX DOC_IMG_BLOB_TGT_TEST4 ON OTXT_ORC_IMAGES_TGT_TEST4
(DOCUMENT_BLOB)
INDEXTYPE IS CTXSYS.CONTEXT
PARAMETERS(' DATASTORE CTXSYS.DEFAULT_DATASTORE
FORMAT COLUMN DOCUMENT_BLOB_FORMAT
NOPOPULATE ');

CREATE UNIQUE INDEX PK_OTXT_ORC_IMAGES_TGT_TEST4 ON OTXT_ORC_IMAGES_TGT_TEST4
(ISN);

ALTER TABLE OTXT_ORC_IMAGES_TGT_TEST4 ADD (
  CONSTRAINT PK_OTXT_ORC_IMAGES_TGT_TEST4
  PRIMARY KEY
  (ISN)
  USING INDEX PK_OTXT_ORC_IMAGES_TGT_TEST4
  ENABLE VALIDATE);

======================================================================
-- INSERT THE BLOB INTO THE TABLE
---- I DON'T SEE AN OPTION FOR ME TO UPLOAD THE TEST FILE HERE
---- PLEASE LET ME KNOW HOW I COULD DO IT.
INSERT INTO OTXT_ORC_IMAGES_TGT_TEST4 
   (ISN, DOCUMENT_BLOB_FORMAT,DOCUMENT_BLOB)
 Values
   (29035, 'BINARY', '&BLOB');
COMMIT;

======================================================================

-- INDEX OPTIMIZE
BEGIN
    ctx_ddl.sync_index ('DOC_IMG_BLOB_TGT_TEST4'
                       ,'500M'
                       ,NULL
                       ,4);
END;
/

======================================================================

-- SEARCH QUERY
SELECT   COUNT (1)
  FROM   OTXT_ORC_IMAGES_TGT_TEST4 i;


COUNT(1)
----------
1
1 row selected.

-- The following query returns 1 but the file doesn’t have the word ‘w2’
SELECT   COUNT (1)
  FROM   OTXT_ORC_IMAGES_TGT_TEST4 i
WHERE   1 = 1 AND isn = 29035 AND CONTAINS (i.document_blob, 'w2') > 0;


COUNT(1)
----------
1
1 row selected.


-- The following query returns 0 but the file has all the search words
SELECT   COUNT (1)
 FROM   OTXT_ORC_IMAGES_TGT_TEST4 i
WHERE       1 = 1
         AND isn = 29035
         AND CONTAINS (
                 i.document_blob
                ,'marathon or executed or purchase or original or correspondence') >
             0;

 


COUNT(1)
----------
0
1 row selected.

========================================
03/04/2020 6:15AM CST:
As requested i have emailed the test document to asktom_us@oracle.com with the subject "Question 9542416600346626815"
Thanks.

and Connor said...

Can you please email the document to asktom_us@oracle.com with the subject "Question 9542416600346626815".

======================

Thanks for the file. I can't reproduce the problem with your script and your file...

SQL> select banner from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production

1 row selected.

SQL> CREATE TABLE OTXT_ORC_IMAGES_TGT_TEST4
  2  (
  3    ISN                   NUMBER                  NOT NULL,
  4    DOCUMENT_BLOB         BLOB,
  5    DOCUMENT_BLOB_FORMAT  VARCHAR2(30 BYTE)
  6  );

Table created.

SQL>
SQL>
SQL> CREATE INDEX DOC_IMG_BLOB_TGT_TEST4 ON OTXT_ORC_IMAGES_TGT_TEST4
  2  (DOCUMENT_BLOB)
  3  INDEXTYPE IS CTXSYS.CONTEXT
  4  PARAMETERS(' DATASTORE CTXSYS.DEFAULT_DATASTORE
  5  FORMAT COLUMN DOCUMENT_BLOB_FORMAT NOPOPULATE');

Index created.

SQL>
SQL> CREATE UNIQUE INDEX PK_OTXT_ORC_IMAGES_TGT_TEST4 ON OTXT_ORC_IMAGES_TGT_TEST4
  2  (ISN);

Index created.

SQL>
SQL> ALTER TABLE OTXT_ORC_IMAGES_TGT_TEST4 ADD (
  2    CONSTRAINT PK_OTXT_ORC_IMAGES_TGT_TEST4
  3    PRIMARY KEY
  4    (ISN)
  5    USING INDEX PK_OTXT_ORC_IMAGES_TGT_TEST4
  6    ENABLE VALIDATE);

Table altered.

SQL>
SQL>
SQL> declare
  2    l_tgt_idx int := 1;
  3    l_src_idx int := 1;
  4    l_bf  bfile;
  5    l_blob   blob;
  6  begin
  7    insert into otxt_orc_images_tgt_test4
  8     (isn, document_blob_format,document_blob)
  9    values
 10     (29035, 'BINARY', empty_blob())
 11    returning document_blob into l_blob;
 12
 13    l_bf := bfilename('TEMP', '29035.pdf');
 14    dbms_lob.fileopen(l_bf, dbms_lob.file_readonly);
 15    dbms_lob.loadblobfromfile (
 16      dest_lob    => l_blob,
 17      src_bfile   => l_bf,
 18      amount      => dbms_lob.lobmaxsize,
 19      dest_offset => l_tgt_idx,
 20      src_offset  => l_src_idx);
 21    dbms_lob.fileclose(l_bf);
 22
 23    commit;
 24  end;
 25  /

PL/SQL procedure successfully completed.

SQL>
SQL>
SQL> BEGIN
  2      ctx_ddl.sync_index ('DOC_IMG_BLOB_TGT_TEST4'
  3                         ,'500M'
  4                         ,NULL
  5                         ,4);
  6  END;
  7  /

PL/SQL procedure successfully completed.

SQL>
SQL>
SQL> SELECT   COUNT (1)
  2    FROM   OTXT_ORC_IMAGES_TGT_TEST4 i
  3  WHERE   1 = 1 AND isn = 29035 AND CONTAINS (i.document_blob, 'w2') > 0;

  COUNT(1)
----------
         0

1 row selected.

SQL>
SQL>
SQL> SELECT   COUNT (1)
  2   FROM   OTXT_ORC_IMAGES_TGT_TEST4 i
  3  WHERE       1 = 1
  4           AND isn = 29035
  5           AND CONTAINS (
  6                   i.document_blob
  7                  ,'marathon or executed or purchase or original or correspondence') > 0;

  COUNT(1)
----------
         1

1 row selected.

SQL>
SQL>


So a couple of things to check

1) Make sure you are loading the blob correctly
2) Enable a trace on the index creation. See the post below

https://blogs.oracle.com/searchtech/oracle-text-indexes-new-logging-and-tracing-methods-in-12cr2

and see if you can derive anything from there.

Hope this helps.

Rating

  (2 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Link out of date

Brian Diehl, March 20, 2023 - 8:34 pm UTC

I was unable to find the referenced blog post.
https://blogs.oracle.com/searchtech/oracle-text-indexes-new-logging-and-tracing-methods-in-12cr2

Nothing is found.
Chris Saxon
March 21, 2023 - 5:46 pm UTC

Sadly that blog has been decommissioned and the post removed :(

More to Explore

Design

New to good database design? Check out Chris Saxon's full fundamentals class.