We have a table with a blob column that stores the blob of various files like PDF, TIF, WORD, etc and the blob field has Oracle text index INDEXTYPE IS CTXSYS.CONTEXT for us to be able to search for text within the blob’s. We are seeing that the search query using the CONTAINS is giving unexpected results.
Below are the steps we followed:
1) Create the table with the CTXSYS.CONTEXT index on the blob field
2) Populate the table with one blob (PDF) which is OCR’d and has words like ‘executed’, ‘correspondence’, ‘purchase’, ‘original’, ‘marathon’ BUT doesn’t have the word ‘w2’
3) Execute ctx_ddl.sync_index to sync the index.
4) Run the search query.
The search query doesn't find the words that are in the PDF but finds words that are NOT in the PDF. Why would the search results be different?
The same works as expected in Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production. I have compared all the stopwords between 19 vs 12.1.0.2 and couldn't any anything different.
Thanks,
Below are the scripts.
======================================================================
--- TABLE SCRIPT
CREATE TABLE OTXT_ORC_IMAGES_TGT_TEST4
(
ISN NUMBER NOT NULL,
DOCUMENT_BLOB BLOB,
DOCUMENT_BLOB_FORMAT VARCHAR2(30 BYTE)
);
CREATE INDEX DOC_IMG_BLOB_TGT_TEST4 ON OTXT_ORC_IMAGES_TGT_TEST4
(DOCUMENT_BLOB)
INDEXTYPE IS CTXSYS.CONTEXT
PARAMETERS(' DATASTORE CTXSYS.DEFAULT_DATASTORE
FORMAT COLUMN DOCUMENT_BLOB_FORMAT
NOPOPULATE ');
CREATE UNIQUE INDEX PK_OTXT_ORC_IMAGES_TGT_TEST4 ON OTXT_ORC_IMAGES_TGT_TEST4
(ISN);
ALTER TABLE OTXT_ORC_IMAGES_TGT_TEST4 ADD (
CONSTRAINT PK_OTXT_ORC_IMAGES_TGT_TEST4
PRIMARY KEY
(ISN)
USING INDEX PK_OTXT_ORC_IMAGES_TGT_TEST4
ENABLE VALIDATE);
======================================================================
-- INSERT THE BLOB INTO THE TABLE
---- I DON'T SEE AN OPTION FOR ME TO UPLOAD THE TEST FILE HERE
---- PLEASE LET ME KNOW HOW I COULD DO IT.
INSERT INTO OTXT_ORC_IMAGES_TGT_TEST4
(ISN, DOCUMENT_BLOB_FORMAT,DOCUMENT_BLOB)
Values
(29035, 'BINARY', '&BLOB');
COMMIT;
======================================================================
-- INDEX OPTIMIZE
BEGIN
ctx_ddl.sync_index ('DOC_IMG_BLOB_TGT_TEST4'
,'500M'
,NULL
,4);
END;
/
======================================================================
-- SEARCH QUERY
SELECT COUNT (1)
FROM OTXT_ORC_IMAGES_TGT_TEST4 i;
COUNT(1)
----------
1
1 row selected.
-- The following query returns 1 but the file doesn’t have the word ‘w2’
SELECT COUNT (1)
FROM OTXT_ORC_IMAGES_TGT_TEST4 i
WHERE 1 = 1 AND isn = 29035 AND CONTAINS (i.document_blob, 'w2') > 0;
COUNT(1)
----------
1
1 row selected.
-- The following query returns 0 but the file has all the search words
SELECT COUNT (1)
FROM OTXT_ORC_IMAGES_TGT_TEST4 i
WHERE 1 = 1
AND isn = 29035
AND CONTAINS (
i.document_blob
,'marathon or executed or purchase or original or correspondence') >
0;
COUNT(1)
----------
0
1 row selected.
========================================
03/04/2020 6:15AM CST:
As requested i have emailed the test document to asktom_us@oracle.com with the subject "Question 9542416600346626815"
Thanks.
Can you please email the document to asktom_us@oracle.com with the subject "Question 9542416600346626815".
======================
Thanks for the file. I can't reproduce the problem with your script and your file...
SQL> select banner from v$version;
BANNER
--------------------------------------------------------------------------------
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
1 row selected.
SQL> CREATE TABLE OTXT_ORC_IMAGES_TGT_TEST4
2 (
3 ISN NUMBER NOT NULL,
4 DOCUMENT_BLOB BLOB,
5 DOCUMENT_BLOB_FORMAT VARCHAR2(30 BYTE)
6 );
Table created.
SQL>
SQL>
SQL> CREATE INDEX DOC_IMG_BLOB_TGT_TEST4 ON OTXT_ORC_IMAGES_TGT_TEST4
2 (DOCUMENT_BLOB)
3 INDEXTYPE IS CTXSYS.CONTEXT
4 PARAMETERS(' DATASTORE CTXSYS.DEFAULT_DATASTORE
5 FORMAT COLUMN DOCUMENT_BLOB_FORMAT NOPOPULATE');
Index created.
SQL>
SQL> CREATE UNIQUE INDEX PK_OTXT_ORC_IMAGES_TGT_TEST4 ON OTXT_ORC_IMAGES_TGT_TEST4
2 (ISN);
Index created.
SQL>
SQL> ALTER TABLE OTXT_ORC_IMAGES_TGT_TEST4 ADD (
2 CONSTRAINT PK_OTXT_ORC_IMAGES_TGT_TEST4
3 PRIMARY KEY
4 (ISN)
5 USING INDEX PK_OTXT_ORC_IMAGES_TGT_TEST4
6 ENABLE VALIDATE);
Table altered.
SQL>
SQL>
SQL> declare
2 l_tgt_idx int := 1;
3 l_src_idx int := 1;
4 l_bf bfile;
5 l_blob blob;
6 begin
7 insert into otxt_orc_images_tgt_test4
8 (isn, document_blob_format,document_blob)
9 values
10 (29035, 'BINARY', empty_blob())
11 returning document_blob into l_blob;
12
13 l_bf := bfilename('TEMP', '29035.pdf');
14 dbms_lob.fileopen(l_bf, dbms_lob.file_readonly);
15 dbms_lob.loadblobfromfile (
16 dest_lob => l_blob,
17 src_bfile => l_bf,
18 amount => dbms_lob.lobmaxsize,
19 dest_offset => l_tgt_idx,
20 src_offset => l_src_idx);
21 dbms_lob.fileclose(l_bf);
22
23 commit;
24 end;
25 /
PL/SQL procedure successfully completed.
SQL>
SQL>
SQL> BEGIN
2 ctx_ddl.sync_index ('DOC_IMG_BLOB_TGT_TEST4'
3 ,'500M'
4 ,NULL
5 ,4);
6 END;
7 /
PL/SQL procedure successfully completed.
SQL>
SQL>
SQL> SELECT COUNT (1)
2 FROM OTXT_ORC_IMAGES_TGT_TEST4 i
3 WHERE 1 = 1 AND isn = 29035 AND CONTAINS (i.document_blob, 'w2') > 0;
COUNT(1)
----------
0
1 row selected.
SQL>
SQL>
SQL> SELECT COUNT (1)
2 FROM OTXT_ORC_IMAGES_TGT_TEST4 i
3 WHERE 1 = 1
4 AND isn = 29035
5 AND CONTAINS (
6 i.document_blob
7 ,'marathon or executed or purchase or original or correspondence') > 0;
COUNT(1)
----------
1
1 row selected.
SQL>
SQL>
So a couple of things to check
1) Make sure you are loading the blob correctly
2) Enable a trace on the index creation. See the post below
https://blogs.oracle.com/searchtech/oracle-text-indexes-new-logging-and-tracing-methods-in-12cr2 and see if you can derive anything from there.
Hope this helps.