Skip to Main Content
  • Questions
  • counting the appearance of a certain word in a document

Breadcrumb

May 4th

Question and Answer

Connor McDonald

Thanks for the question, Ionut.

Asked: March 06, 2017 - 2:24 pm UTC

Last updated: May 05, 2017 - 1:51 am UTC

Version: 11g

Viewed 1000+ times

You Asked

Hello,

I am trying to count the appearance of a certain word into a document (stored as BFILE in database).
Table definition looks like this:
CREATE TABLE MY_DOC
(ID NUMBER PRIMARY KEY,
FILE_NAME VARCHAR2(255),
UPLOAD_DATE VARCHAR2(10),
FILESIZE VARCHAR2(20),
FILETYPE VARCHAR2(5),
CONTENT BFILE,
CREATION_DATE DATE DEFAULT sysdate NOT NULL ENABLE,
MODIFICATION_DATE DATE,
ACCESSED_DATE DATE);

its content is:
ID FILE_NAME UPLOAD_DATE FILESIZE FILETYPE CREATION_DATE MODIFICATION_DATE ACCESSED_DATE CONTENT
242 test.doc 06-MAR-17 37 .doc 23-FEB-17 02-MAR-17 23-FEB-17 (BFILE)
232 htmldoc.html 06-MAR-17 28 .html 06-MAR-17 06-MAR-17 06-MAR-17 (BFILE)
243 text.txt 06-MAR-17 24 .pdf 27-FEB-17 02-MAR-17 27-FEB-17 (BFILE)
223 wordfile.docx 06-MAR-17 1913 .docx 06-MAR-17 06-MAR-17 06-MAR-17 (BFILE)

Each file contains more or less the word 'test'.
I want to create a query to display the file name and how many time the 'test' word appears inside.
Is there a problem if some files are .DOCX or .PDF? Would the query display a correct result?

Can you help me with a hint how to build the query, please?

Thank you very much,
Ionut P.

and Connor said...

You can use the CTX_DOC.TOKENS package to extract the tokens (not including default stopwords), eg

SQL> create table documents (id number primary key, text bfile);

Table created.

SQL> insert into documents values (1, bfilename('MY_DIR','12things_about_122.pdf'));

1 row created.

SQL> create index documents_idx on documents (text) indextype is ctxsys.context;

Index created.

SQL> set serveroutput on size 1000000
SQL> declare
  2   the_tokens ctx_doc.token_tab;
  3  begin
  4   ctx_doc.set_key_type ('PRIMARY_KEY');
  5   ctx_doc.tokens('documents_idx','1',the_tokens);
  6   dbms_output.put_line('Number of tokens: '|| the_tokens.count);
  7  end;
  8  /
Number of tokens: 11452

PL/SQL procedure successfully completed.


Since we actually *extract* all of the tokens, you would possibly want to do this just once at insertion time and store it with the row, rather than pick it up every time it is queried.

Rating

  (3 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

not only once interogation

Ionut Preda, March 07, 2017 - 11:39 am UTC

Hello Connor,

Thank you very much for reply. It is a starting point.

1st, the interrogation will be performed without knowing the exact matching word. It is up to the user; he might want to search for "blablabla" or "tralala" or anything he wants. I guess, I have to create a functions which will return the number of tokens. Am I right?

2nd, I was trying your example but I received "DRG-10001: can not access result table the_tokens" error message.

Should I previously do something else?

Thank you very much,
Ionut P.
Connor McDonald
May 05, 2017 - 1:51 am UTC

Sorry for the delay - I got some good information back from the Text Product Manager

If you are happy with a count limit of 100, you can use a text index and a "query template" which specifies
<score algorithm="count">

SQL> create table t (id int primary key, doc varchar2(2000));

Table created.

SQL>
SQL> -- 1 to 5 occurrences of the word "oracle"
SQL> insert into t values (1, 'oracle');

1 row created.

SQL> insert into t values (2, 'oracle oracle');

1 row created.

SQL> insert into t values (3, 'oracle oracle oracle');

1 row created.

SQL> insert into t values (4, 'oracle oracle oracle oracle');

1 row created.

SQL> insert into t values (5, 'oracle oracle oracle oracle oracle');

1 row created.

SQL>
SQL> -- 110 occurrences of the word "oracle"
SQL> insert into t values (6, 'oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle o
racle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracl
e oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle or
acle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle oracle');

1 row created.

SQL>
SQL> create index t_index on t(doc) indextype is ctxsys.context;

Index created.

SQL>
SQL> select id, score(1) as occurrence_count from t where contains( doc, '
  2  <query>
  3    <textquery>
  4      oracle
  5    </textquery>
  6    <score algorithm="count" />
  7  </query>
  8  ', 1) > 0;

        ID OCCURRENCE_COUNT
---------- ----------------
         1                1
         2                2
         3                3
         4                4
         5                5
         6              100   -- there are actually 110

6 rows selected.


If you need to allow for counts larger than 100, it's probably best to dispense with the text index, and use CTX_DOC.POLICY_HIGHLIGHT. This uses a POLICY (sometimes described as an index definition without an index) and scans each document provided to it to look for search terms. It's designed to provide offset and length for the "hits" within the document, but we can dispense with them just count the number of highlights.
Obviously as this doesn't use an index it will be less efficient, but there is no limit on the count returned.


SQL>
SQL> set serverout on
SQL>
SQL> begin
  2    ctx_ddl.create_policy (
  3       policy_name => 'occurrence_policy'
  4    );
  5  end;
  6  /

PL/SQL procedure successfully completed.

SQL>
SQL> create or replace function occurrence_count
  2     (document varchar2,
  3      search_term varchar2
  4     )
  5  return number is
  6    highlight_tab ctx_doc.highlight_tab;
  7  begin
  8    for c in ( select id, doc from t ) loop
  9      ctx_doc.policy_highlight( 'T_INDEX', document, search_term, highlight_tab);
 10      return highlight_tab.count;
 11    end loop;
 12  end;
 13  /

Function created.

SQL>
SQL> select id, occurrence_count( doc, 'oracle') from T;

        ID OCCURRENCE_COUNT(DOC,'ORACLE')
---------- ------------------------------
         1                              1
         2                              2
         3                              3
         4                              4
         5                              5
         6                            110

6 rows selected.

SQL>


Ionut Preda, March 08, 2017 - 8:56 am UTC

The full error message:

Error report:
ORA-20000: Oracle Text error:
DRG-10001: can not access result table the_tokens
ORA-06512: at "CTXSYS.DRUE", line 160
ORA-06512: at "CTXSYS.CTX_DOC", line 862
ORA-06512: at line 5
20000. 00000 - "%s"
*Cause: The stored procedure 'raise_application_error'
was called which causes this error to be generated.
*Action: Correct the problem as described in the error message or contact
the application administrator or DBA for more information.
Chris Saxon
March 09, 2017 - 1:51 pm UTC

What exactly are you trying? Show us your code!

Ionut Preda, March 10, 2017 - 8:27 am UTC

Hello Chris,

I created a function, based on the code provided by Connor:

create or replace function fnc_count_tokens(p_word varchar2) return number as
the_tokens ctxsys.ctx_doc.token_tab;
begin
ctxsys.ctx_doc.set_key_type('PRIMARY_KEY');
ctxsys.ctx_doc.tokens('documents_idx', '1', the_tokens);
RETURN the_tokens.count;
end ;

I compile this function in "test" schema (other than ctxsys) and try to execute it over the table.
When I do this, I have the error message described above.

Is it necessary extra grants or something else?

Thank you,
Ionut P.

More to Explore

CLOB/BLOB/etc

Complete documentation on Securefiles and Large Objects here