Skip to Main Content
  • Questions
  • Defining stemming parameters results in unexpected query outcome in Oracle Text

Breadcrumb

Question and Answer

Chris Saxon

Thanks for the question.

Asked: November 08, 2016 - 8:29 am UTC

Last updated: November 10, 2016 - 12:40 pm UTC

Version: Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production

Viewed 1000+ times

You Asked

Dear Tom,

we are using Oracle Text to compare string values and observe some surprising results, which seem to depend on the definition of the index. Please have a look a the following toy example:

CREATE TABLE tmp (itemnum NUMBER, description VARCHAR2(30));

INSERT ALL
  INTO tmp VALUES (1, 'test')
  INTO tmp VALUES (2, 'in vitro')
  INTO tmp VALUES (3, 'novitro')
  INTO tmp VALUES (4, 'invitro')
  INTO tmp VALUES (5, 'in-vitro')
SELECT * FROM dual;

CREATE INDEX idx_tmp ON tmp (description) INDEXTYPE IS ctxsys.CONTEXT;

SELECT itemnum, SCORE(1), description FROM tmp WHERE CONTAINS (description,'$INDIANA', 1) > 0;


As expected 'INDIANA' is not related to any string in the table. Now we define the stemmer values of the index:

DROP INDEX idx_tmp;

BEGIN
  ctx_ddl.drop_preference ('kw_lexer');
  ctx_ddl.create_preference('kw_lexer', 'BASIC_LEXER');
  ctx_ddl.set_attribute ('kw_lexer', 'index_themes', 'NO');
  ctx_ddl.set_attribute ('kw_lexer', 'index_text', 'YES');
  ctx_ddl.set_attribute('kw_lexer', 'index_stems', 'ENGLISH');
  ctx_ddl.drop_preference ('kw_wordlist');
  ctx_ddl.create_preference('kw_wordlist', 'BASIC_WORDLIST');
  ctx_ddl.set_attribute('kw_wordlist', 'stemmer', 'ENGLISH'); END;

CREATE INDEX idx_tmp ON tmp (description) INDEXTYPE IS ctxsys.CONTEXT PARAMETERS ('LEXER kw_lexer WORDLIST kw_wordlist');

SELECT itemnum, SCORE(1), description FROM tmp WHERE CONTAINS (description,'$INDIANA', 1) > 0;


On our server the last query returns to our surprise two values: 'in vitro' and 'in-vitro'. Sure, both contain the 'in' string, but does that 'in' string really denote the root of 'INDIANA' and 'invitro'? And if so, why doesn't the default stemmer picks that up? Or, running the server in Germany, could maybe some predefined (language) parameters explain this query result?

Any help and insight would be appreciated.

Kind regads,
Stephan

and Chris said...

I've asked Roger Ford, Oracle Text PM, to answer this. It seems the problem relates to US state abbreviations...


1. "IN" and "indiana" are considered to share the same stem (even though "Middx" and "Middlesex" don't - but nobody's surprised by US bias in such things). The actual "stem", in this case, is "Indiana".

2. "in" is in the default English stoplist

3. "in-vitro" is lexed as two words, "in" and "vitro" with default lexer settings.

The results below match "in vitro" and "in-vitro" as these both contain the word "in". However, they (correctly) do not match "invitro"
If we use the EMPTY_STOPLIST, as in the attached script, then both queries return the same result.

So ... using the original example with the default stoplist:

In the first example, "in" is not indexed. The stem search for "$indiana" is expanded to "in=indiana" - an equivalence search which will find either word. It doesn't match anything because "in" is not in the index (because it's a stopword) and "indiana" isn't in there either.

In the second example, we've chosen to index stems. We come across "in" and store its stem - "indiana" with TOKEN_TYPE=9 for a stem. Now this time, the search knows that stems are indexed, so when we search for "Indiana" it simply looks up "indiana" with TOKEN_TYPE=9 and finds it.

It could be argued that stemming should be done AFTER stopword removal - in which case we wouldn't get this match. But neither would we ever get a match on the state abbreviation.


And here's the script he refers to:

CREATE TABLE tmp (itemnum NUMBER, description VARCHAR2(30));

INSERT ALL
  INTO tmp VALUES (1, 'test')
  INTO tmp VALUES (2, 'in vitro')
  INTO tmp VALUES (3, 'novitro')
  INTO tmp VALUES (4, 'invitro')
  INTO tmp VALUES (5, 'in-vitro')
  INTO tmp VALUES (6, 'at vitro')
SELECT * FROM dual;

CREATE INDEX idx_tmp ON tmp (description) INDEXTYPE IS ctxsys.CONTEXT parameters ('stoplist ctxsys.empty_stoplist');

SELECT itemnum, SCORE(1), description FROM tmp WHERE CONTAINS 
(description,'$INDIANA', 1) > 0;

ITEMNUM  SCORE(1)  DESCRIPTION  
2        4         in vitro     
5        4         in-vitro  

DROP INDEX idx_tmp;

exec ctx_ddl.drop_preference ('kw_lexer')
exec ctx_ddl.drop_preference ('kw_wordlist')

BEGIN
  ctx_ddl.create_preference('kw_lexer', 'BASIC_LEXER');
  ctx_ddl.set_attribute ('kw_lexer', 'index_themes', 'NO');
  ctx_ddl.set_attribute ('kw_lexer', 'index_text', 'YES');
  ctx_ddl.set_attribute('kw_lexer', 'index_stems', 'ENGLISH');
  ctx_ddl.create_preference('kw_wordlist', 'BASIC_WORDLIST');
  ctx_ddl.set_attribute('kw_wordlist', 'stemmer', 'ENGLISH'); 
END;
/

CREATE INDEX idx_tmp ON tmp (description) INDEXTYPE IS ctxsys.CONTEXT 
PARAMETERS ('LEXER kw_lexer WORDLIST kw_wordlist stoplist ctxsys.empty_stoplist');

select token_text, token_type from dr$idx_tmp$i;

TOKEN_TEXT  TOKEN_TYPE  
AT          0           
IN          0           
INDIANA     9           
INVITRO     0           
NOVITRO     0           
TEST        0           
VITRO       0  

SELECT itemnum, SCORE(1), description FROM tmp WHERE CONTAINS 
(description,'$INDIANA', 1) > 0;

ITEMNUM  SCORE(1)  DESCRIPTION  
2        4         in vitro     
5        4         in-vitro  

Rating

  (1 rating)

Is this answer out of date? If it is, please let us know via a Comment

Comments

A reader, November 10, 2016 - 1:08 pm UTC