I would like to be able to roll up a word count (for a WordCloud control in a UI) based on a query ( a spatial query in this case).
I have a table as such
create table items (id raw(16), text clob, geom sdo_geometry )
This table has a full text index (context) on the TEXT column, and a spatial index.
The database is 12.2, and the index is in "Real time mode."
What I was hoping to do was leverage the CONTEXT index's TOKENs tables (DR$ITEMS_NAME_IDX$<?> in a join like this to provide word count, rather than rolling my own string splitter etc.
sum(token_count) as wordcount,
TOKEN_TEXT as word
from DR$ITEMS_NAME_IDX$k a
inner join items b on a.textkey = b.rowid
inner join ???
group by TOKEN_TEXT;
Is something like this possible? Or should I proceed to write a function that splits the string to table?
Also I noticed there is a Blob column that stores token information, perhaps there is a way to get data out of this for what I need?
Thanks for your help
Well, you don't need to use Oracle Text indexes to do this at all!
Using standard SQL there are various words-to-rows techniques. For example, if you have simple, single spaced text, you can find the number of words by subtracting the length of the text from itself with the spaces removed. And adding one.
You can then use the "connect by level" row generation trick to convert these to rows and count them up. The introduction of lateral in 12c made this a easier:
create table t (
insert into t values ('some text more text extra text some more text');
trim(regexp_substr(t.x, '[[:alpha:]]+', 1, words.l)) word,
select level l from dual
connect by level <= length (t.x) - length(replace(t.x, ' ')) + 1
group by trim(regexp_substr(t.x, '[[:alpha:]]+', 1, words.l))
order by count(*) desc;
Of course, this will find all the words. One advantage of Oracle Text Oracle Text is you can use a stoplist to exclude common words (as, and, of, etc.)
But again, you don't actually need to create a text index to do this. You can pass the string you want to tokenize to ctx_doc.policy_tokens. This splits the string up to words.
All you need to do then is aggregate the results:
'some text more text extra text some more text',rtab);
for words in (
select token, count(*) c from table(rtab)
group by token
order by count(*) desc
dbms_output.put_line(words.token || ' count = ' || words.c);
TEXT count = 4
MORE count = 2
EXTRA count = 1
Some is in the default stoplist. So it's no longer included in the output.