Skip to Main Content
  • Questions
  • How prefix_index and substring_index improve wildcard searches in Oracle Text

Breadcrumb

Question and Answer

Chris Saxon

Thanks for the question, p.

Asked: July 28, 2020 - 4:20 am UTC

Last updated: August 04, 2020 - 4:09 pm UTC

Version: 11g

Viewed 1000+ times

You Asked

Hi Team,
I'm doing some Oracle Text work related to prefix_index and substring_index (we use Oracle 11g).
I'm interested in why:
(1)Prefix indexing improves performance for right truncated wildcard searches such as TO%
(2)A substring index improves left-truncated and double-truncated wildcard queries such as %ing or %benz%

Could you help to check whether my following understanding is right or wrong? Thanks!
(Maybe the following understanding and questions are not very clear. I just want to know the search logic and process of queries like %abc, abc%, %abc% using prefix_index and substring_index by Oracle Text)

(1) I want to confirm principles of extracting tokens using Oracle Text.
After we extract tokens from text (tokens are stored in table $I), keywords that user inputs to query will be extract into tokens, too. If at least one keyword token is equal to text token, then user can get matched text.
For example, text1 is extracted into token1 and token2 (stored in table $I). Then user inputs some keywords to query. Keywords are extracted into token2 and token3, since token2 is stroed in $I, user can get search result containing text1.
However, if keywords are extracted into token3 and token4. Suppose that token3 is prefix of token2, user still can't get search result containing text1 because token2 != token3.
Am I understanding right or wrong? Thanks.

(2)When we use prefix_index (and set prefix_length_min=2, prefix_length_max=3), some prefix tokens (token_type = 6 ) will be stored in $I. When we use right truncated wildcard searches, for example:
(a) Search to%. Length of "to" is 2. So it will directly search in tokens (only token_type=6) in $I table. If token "to" is in $I and its token_type is 6, then user can get search results matching to%. If token "to" is not in $I, then search result is empty.
But we find that sometimes one prefix token (token_type = 6) is stored in $I table (e.g. this token is "站起"). Then we search 站起%, the search result is empty. Do you know the reason? Thanks.
(b) Search toto%. Length of "toto" is 4 (prefix_legth_max is 3). So it will directly search every token (token_type = 0) in $I table.
Am I understanding right or wrong? Thanks.

(3)When we use substring_index, some substring tokens will be stored in $P. When we use left truncated wildcard searches, for example:
(a) Search %abc, then it will directly search in $P table using column PAT_PART2. If PAT_PART2 has token abc, then user can get search results matching %abc. If PAT_PART2 hasn't token abc, then search result is empty.
(b) Search %a (length of "a" is 1), since length of all tokens in $P table in column PAT_PART2 is larger than 1, so it will directly search every token (token_type = 0) in $I table.
Am I understanding right or wrong? Thanks.
And I want to know that why using substring_index can improves double-truncated wildcard queries like %benz% (I understand substring_index can improve left-truncated wildcard queries, but I want to know the process and principle of querying %benz% by Oracle Text. Could you help to explain it, thanks!)



and Chris said...

I reached out to Roger Ford, PM for Oracle Text to help with this. Here are his responses:

1

I'd take issue with "if it least one keyword token is equal...", it depends very much on the query. If I search for "dog AND cat", then both those tokens must be present. If I search for "dog cat" then that's a phrase search and both words must be present, in order and as consecutive tokens.

> However, if keywords are extracted into token3 and token4. Suppose that token3 is prefix of token2, user still can't get search result containing text1 because token2 != token3.
> Am I understanding right or wrong? Thanks.

I'm struggling to understand your example here. I'm not sure what you're actually searching for. Perhaps you could show some code for your example, or at least use some actual words.

2

Firstly, it's PREFIX_MIN_LENGTH and PREFIX_MAX_LENGTH, rather than what you wrote. Early doc is incorrect on this, 19c doc is correct.

If you do a wildcard search where the search term is shorter than PREFIX_MIN_LENGTH, then it will ignore the prefix tokens (TOKEN_TYPE=6) and do a normal wildcard lookup against normal tokens (TOKEN_TYPE=0).

If you're testing this, don't forget that 'to' is a default wildcard, so it will NOT be indexed if you're using the default English stopword list.


> But we find that sometimes one prefix token (token_type = 6) is stored in $I table (e.g. this token is "站起"). Then we search 站起%, the search result is empty. Do you know the reason? Thanks.
> (b) Search toto%. Length of "toto" is 4 (prefix_legth_max is 3). So it will directly search every token (token_type = 0) in $I table.
> Am I understanding right or wrong? Thanks.

Ah, Chinese characters. That's a whole different matter. What lexer are you using to index the Chinese text? The CHINESE_LEXER and CHINESE_VGRAM_LEXER do not require or support wildcards or the PREFIX_INDEX preference. If you're using BASIC_LEXER for Chinese text (which is definitely not recommended) you may find you get some odd results

3

> Am I understanding right or wrong? Thanks.

Yes

> And I want to know that why using substring_index can improves double-truncated wildcard queries like %benz% (I understand substring_index can improve left-truncated wildcard queries, but I want
> to know the process and principle of querying %benz% by Oracle Text. Could you help to explain it, thanks!)

OK. Let's see what we get in the $P table if we index the word 'FOOBARBAZ':

PAT_PART1 PAT_PART2
--------- ------------
          FOOBARBAZ
F          OOBARBAZ
FO          OBARBAZ
FOO          BARBAZ
FOOB          ARBAZ
FOOBA          RBAZ
FOOBAR          BAZ


Now let's say you wanted to search for '%BAR%'. Internally, Oracle Text does this (my index is called FOOINDEX):
SQL> SELECT pat_part1||pat_part2 FROM dr$fooindex$p WHERE pat_part2 LIKE 'BAR%';

PAT_PART1||PAT_PART2
--------------------------------------------------------------------------------
FOOBARBAZ


So now it's found all words with BAR in the middle of the word. It can then look up those words in the $I table to get their location.

Roger


Is this answer out of date? If it is, please let us know via a Comment