Skip to Main Content
  • Questions
  • Getting word count from context text index by joining to indexed rows

Breadcrumb

Question and Answer

Chris Saxon

Thanks for the question, Mark.

Asked: October 16, 2017 - 9:40 pm UTC

Last updated: October 21, 2021 - 10:51 am UTC

Version: 12.2

Viewed 10K+ times! This question is

You Asked

Hello Tom,
I would like to be able to roll up a word count (for a WordCloud control in a UI) based on a query ( a spatial query in this case).

I have a table as such
create table items (id raw(16), text clob, geom sdo_geometry )


This table has a full text index (context) on the TEXT column, and a spatial index.
The database is 12.2, and the index is in "Real time mode."


What I was hoping to do was leverage the CONTEXT index's TOKENs tables (DR$ITEMS_NAME_IDX$<?> in a join like this to provide word count, rather than rolling my own string splitter etc.
select 
sum(token_count) as wordcount, 
TOKEN_TEXT as word
from DR$ITEMS_NAME_IDX$k a
inner join items b on a.textkey = b.rowid
inner join ??? 
where sdo_relate(b.geom,PARAM_GEOMETRY)='TRUE'
group by TOKEN_TEXT;

Is something like this possible? Or should I proceed to write a function that splits the string to table?

Also I noticed there is a Blob column that stores token information, perhaps there is a way to get data out of this for what I need?

Thanks for your help

and Chris said...

Well, you don't need to use Oracle Text indexes to do this at all!

Using standard SQL there are various words-to-rows techniques. For example, if you have simple, single spaced text, you can find the number of words by subtracting the length of the text from itself with the spaces removed. And adding one.

You can then use the "connect by level" row generation trick to convert these to rows and count them up. The introduction of lateral in 12c made this a easier:

create table t (
  x varchar2(100)
);
insert into t values ('some text more text extra text some more text');
commit;
select 
  trim(regexp_substr(t.x, '[[:alpha:]]+', 1, words.l)) word,
  count(*)
from t,
     lateral (
       select level l from dual 
       connect by level <= length (t.x) - length(replace(t.x, ' ')) + 1 
     ) words
group by trim(regexp_substr(t.x, '[[:alpha:]]+', 1, words.l))
order by count(*) desc;

WORD    COUNT(*)   
text             4 
some             2 
more             2 
extra            1 


Of course, this will find all the words. One advantage of Oracle Text Oracle Text is you can use a stoplist to exclude common words (as, and, of, etc.)

But again, you don't actually need to create a text index to do this. You can pass the string you want to tokenize to ctx_doc.policy_tokens. This splits the string up to words.

All you need to do then is aggregate the results:

exec ctx_ddl.create_policy('mypolicy');
declare
  la     varchar2(200);
  rtab   ctx_doc.token_tab;
begin
   ctx_doc.policy_tokens('mypolicy', 
        'some text more text extra text some more text',rtab);
   for words in (
     select token, count(*) c from table(rtab)
     group  by token
     order  by count(*) desc
   ) loop
     dbms_output.put_line(words.token || ' count = ' || words.c);
   end loop;
end;
/

TEXT count = 4
MORE count = 2
EXTRA count = 1


Some is in the default stoplist. So it's no longer included in the output.

Rating

  (3 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Missing the point of the original question?

mathguy, November 23, 2020 - 4:54 pm UTC

I believe the answer offered does not actually address the question that was asked.

The poster doesn't "need an index" to solve the word count problem. He already has a text index for other purposes - he just wants to know if he can leverage that so that he doesn't have to do all the things you suggested in your answer.

I don't really see an answer to the poster's question. Perhaps the answer is a simple "no, you can't use a text index for that purpose" - I don't know myself, and I think an answer to that specific question would be helpful to others (like myself), not just to the original poster.
Chris Saxon
November 24, 2020 - 6:01 pm UTC

From Roger Ford, Oracle Text PM:

You can get the number of documents that contain a particular token direct from the index:

select sum(token_count) from dr$<index_name>$i where token_text = 'TOKEN'

But you can't get a count of all the actual occurrences of the token (given that a token might be repeated may times in a single document).


So no you can't - you need a workaround like described above.

Amin Adatia, October 14, 2021 - 9:14 pm UTC

If I have a query with CONTAINS(Index_Column,'token1,token2, etc)>0
is it possible to get a count of records for each of the tokens in my list?
Chris Saxon
October 15, 2021 - 5:45 pm UTC

What exactly is it you want?

You can use an expression in COUNT itself, something like this might do the trick:

count ( case when CONTAINS(Index_Column,'token1')>0 then 1 end ),
count ( case when CONTAINS(Index_Column,'token2')>0 then 1 end )

I'm not sure what it is you're asking though

Token Counts

Amin Adatia, October 20, 2021 - 11:04 am UTC

Hi

What I wanted to do was to take the string of tokens passed into the query as in CONTAINS(Column,'token1,token2, token3,...tokenN')>0
to give a result like
Token1 => 20 Documents
Token2 => 5 Documents
etc.
I wrote a function to parse the tokens string to return individual tokens and the passed each of the tokens to do the CONTAINS query.

Don't know if there is a built-in function in Oracle Text to do the same

Chris Saxon
October 21, 2021 - 10:51 am UTC

I'm unclear exactly what you're attempting here - please post this as a new question with a complete test case (create table + insert into + expected output) showing what you're trying to do

More to Explore

Design

New to good database design? Check out Chris Saxon's full fundamentals class.