Dear Tom,
we are using Oracle Text to compare string values and observe some surprising results, which seem to depend on the definition of the index. Please have a look a the following toy example:
CREATE TABLE tmp (itemnum NUMBER, description VARCHAR2(30));
INSERT ALL
INTO tmp VALUES (1, 'test')
INTO tmp VALUES (2, 'in vitro')
INTO tmp VALUES (3, 'novitro')
INTO tmp VALUES (4, 'invitro')
INTO tmp VALUES (5, 'in-vitro')
SELECT * FROM dual;
CREATE INDEX idx_tmp ON tmp (description) INDEXTYPE IS ctxsys.CONTEXT;
SELECT itemnum, SCORE(1), description FROM tmp WHERE CONTAINS (description,'$INDIANA', 1) > 0;
As expected 'INDIANA' is not related to any string in the table. Now we define the stemmer values of the index:
DROP INDEX idx_tmp;
BEGIN
ctx_ddl.drop_preference ('kw_lexer');
ctx_ddl.create_preference('kw_lexer', 'BASIC_LEXER');
ctx_ddl.set_attribute ('kw_lexer', 'index_themes', 'NO');
ctx_ddl.set_attribute ('kw_lexer', 'index_text', 'YES');
ctx_ddl.set_attribute('kw_lexer', 'index_stems', 'ENGLISH');
ctx_ddl.drop_preference ('kw_wordlist');
ctx_ddl.create_preference('kw_wordlist', 'BASIC_WORDLIST');
ctx_ddl.set_attribute('kw_wordlist', 'stemmer', 'ENGLISH'); END;
CREATE INDEX idx_tmp ON tmp (description) INDEXTYPE IS ctxsys.CONTEXT PARAMETERS ('LEXER kw_lexer WORDLIST kw_wordlist');
SELECT itemnum, SCORE(1), description FROM tmp WHERE CONTAINS (description,'$INDIANA', 1) > 0;
On our server the last query returns to our surprise two values: 'in vitro' and 'in-vitro'. Sure, both contain the 'in' string, but does that 'in' string really denote the root of 'INDIANA' and 'invitro'? And if so, why doesn't the default stemmer picks that up? Or, running the server in Germany, could maybe some predefined (language) parameters explain this query result?
Any help and insight would be appreciated.
Kind regads,
Stephan
I've asked Roger Ford, Oracle Text PM, to answer this. It seems the problem relates to US state abbreviations...
1. "IN" and "indiana" are considered to share the same stem (even though "Middx" and "Middlesex" don't - but nobody's surprised by US bias in such things). The actual "stem", in this case, is "Indiana".
2. "in" is in the default English stoplist
3. "in-vitro" is lexed as two words, "in" and "vitro" with default lexer settings.
The results below match "in vitro" and "in-vitro" as these both contain the word "in". However, they (correctly) do not match "invitro"
If we use the EMPTY_STOPLIST, as in the attached script, then both queries return the same result.
So ... using the original example with the default stoplist:
In the first example, "in" is not indexed. The stem search for "$indiana" is expanded to "in=indiana" - an equivalence search which will find either word. It doesn't match anything because "in" is not in the index (because it's a stopword) and "indiana" isn't in there either.
In the second example, we've chosen to index stems. We come across "in" and store its stem - "indiana" with TOKEN_TYPE=9 for a stem. Now this time, the search knows that stems are indexed, so when we search for "Indiana" it simply looks up "indiana" with TOKEN_TYPE=9 and finds it.
It could be argued that stemming should be done AFTER stopword removal - in which case we wouldn't get this match. But neither would we ever get a match on the state abbreviation.
And here's the script he refers to:
CREATE TABLE tmp (itemnum NUMBER, description VARCHAR2(30));
INSERT ALL
INTO tmp VALUES (1, 'test')
INTO tmp VALUES (2, 'in vitro')
INTO tmp VALUES (3, 'novitro')
INTO tmp VALUES (4, 'invitro')
INTO tmp VALUES (5, 'in-vitro')
INTO tmp VALUES (6, 'at vitro')
SELECT * FROM dual;
CREATE INDEX idx_tmp ON tmp (description) INDEXTYPE IS ctxsys.CONTEXT parameters ('stoplist ctxsys.empty_stoplist');
SELECT itemnum, SCORE(1), description FROM tmp WHERE CONTAINS
(description,'$INDIANA', 1) > 0;
ITEMNUM SCORE(1) DESCRIPTION
2 4 in vitro
5 4 in-vitro
DROP INDEX idx_tmp;
exec ctx_ddl.drop_preference ('kw_lexer')
exec ctx_ddl.drop_preference ('kw_wordlist')
BEGIN
ctx_ddl.create_preference('kw_lexer', 'BASIC_LEXER');
ctx_ddl.set_attribute ('kw_lexer', 'index_themes', 'NO');
ctx_ddl.set_attribute ('kw_lexer', 'index_text', 'YES');
ctx_ddl.set_attribute('kw_lexer', 'index_stems', 'ENGLISH');
ctx_ddl.create_preference('kw_wordlist', 'BASIC_WORDLIST');
ctx_ddl.set_attribute('kw_wordlist', 'stemmer', 'ENGLISH');
END;
/
CREATE INDEX idx_tmp ON tmp (description) INDEXTYPE IS ctxsys.CONTEXT
PARAMETERS ('LEXER kw_lexer WORDLIST kw_wordlist stoplist ctxsys.empty_stoplist');
select token_text, token_type from dr$idx_tmp$i;
TOKEN_TEXT TOKEN_TYPE
AT 0
IN 0
INDIANA 9
INVITRO 0
NOVITRO 0
TEST 0
VITRO 0
SELECT itemnum, SCORE(1), description FROM tmp WHERE CONTAINS
(description,'$INDIANA', 1) > 0;
ITEMNUM SCORE(1) DESCRIPTION
2 4 in vitro
5 4 in-vitro