Skip to Main Content
  • Questions
  • Oracle Text Synonym and Theasurus Questions

Breadcrumb

Question and Answer

Tom Kyte

Thanks for the question, Pawan.

Asked: August 14, 2006 - 4:57 pm UTC

Last updated: August 28, 2006 - 1:06 pm UTC

Version: 10.2.0.2

Viewed 1000+ times

You Asked

Hi, we are planning to use Oracle Text based search for searching contacts within our 10g database. We would like to leverage CATCTX indexes as data exisits in a structured fashion within multiple tables. We intend on building a concatenated data store.

I was wondering if you could help with some of the catsearch queries we are trying to build. We are trying to define the right thesaurus and synonym constructs to model some of the scenarios below.

1) Title based search: We have titles in our database such as:
"vp of eng."
"vp of engg"
"v.p. of engineering"
"vice president, eng"
"vice pres. Engineering"
"vice president of development and engineering"
"vp of marketing"
"software engineer"

So when a user wants to search by title " vp of engineering", the search should return all variations of "vp of engineering" but should not return software engineer or vp of marketing.

2) Gender based search:
Data: Raymond Lane
Raylene Jones

Search:
"Ray" - should return Raymond Lane and Raylene Jones
"Raymond" - should only return Raymond Lane
"Raylene" - should only return Raylene Jones

Any help will be greatly appreciated.

Thanks.
-Pawan


and Tom said...

I asked Omar Alonso (product manager for Oracle Text) for some assistance on this one and he said:


Question #1

This is a good search quality question. A great benefit of Oracle Text is
that the API is powerful enough so you can attack the problem from different
angles. This problem looks like a good candidate for a new technique
introduced in 10g called progressive relaxation.

Progressive relaxation is a technique that uses query variation to maximize
precision while guaranteeing recall. Think of the user-entered query as
simply a sequence of words. You can put these words together in a variety of
ways to form text queries, some more precise than others. You run the variants
sequentially, from most precise to least precise, until you get the number of
hits you need.

The great advantage of progressive relaxation is that you can use it as
your search quality template and tune it until you like the desired output. So
we can split the problem in two steps. First is to came up with a good
relaxation and then build a query parser that will excersie the sequence of
relaxations. I present the first step since the query parser should be
easy to build on.

-- Create table

create table db_titles (titles varchar(50));

insert into db_titles values('vp of eng.');
insert into db_titles values('vp of engg');
insert into db_titles values('v.p. of engineering');
insert into db_titles values('vice president, eng');
insert into db_titles values('vice pres. Engineering');
insert into db_titles values('vice president of development and engineering');
insert into db_titles values('vp of marketing');
insert into db_titles values('software engineer');

commit;

-- Since the data can have a any number of characters as part of the terms,
-- it is useful to define how you plan to normalize the data.
-- For example, you can have "V.P.", "VP." or "VP" and you would like to
-- treat them the same: "VP". The Oracle Text lexer preferences has all the
-- different parameters for the lexer

begin
ctx_ddl.create_preference('mylexp','basic_lexer');
ctx_ddl.set_attribute('mylexp','skipjoins','.');
end;

-- Create Text "context" index

create index title_idx on db_titles(titles)
indextype is ctxsys.context parameters('lexer mylexp');

-- Progressive relaxation sequence:
--
-- 1: "vp" and stems "engineering". A stem operation returns the root of
-- a term. In case of "engineering" -> engineer,engineers.
-- 2: "vice" and equivalence of "pres" and "president"
-- follows by equivalence of "eng" and "engineer"
-- The operator equivalence specifies an acceptable substitution for a
-- word in a query.
-- 3: "vice" stems "president" stems "engineer"; using near in a 6 word clump
-- Near return a score based on the proximity of two or more query terms.
-- A clump is the smallest group of words in which all query terms occur.
-- 4: vp near (equivalence of eng and engg)
-- Note that in #3 and #4 you can do mix different operators

select score(1),titles from db_titles where contains(titles,
'<query><textquery><progression>
<seq>(vp of $(engineering))</seq>
<seq>(vice (pres equiv president) ( eng equiv engineering))</seq>
<seq>(near((vice,$(president),$(engineer)),6,TRUE))</seq>
<seq>(near((vp,eng=engg),3,TRUE))</seq>
</progression></textquery></query>',1)>0 order by score(1) desc;

SQL> select score(1),titles from db_titles where contains(titles,
2 '<query><textquery><progression>
3 <seq>(vp of $(engineering))</seq>
4 <seq>(vice (pres equiv president) ( eng equiv engineering))</seq>
5 <seq>(near((vice,$(president),$(engineer)),6,TRUE))</seq>
6 <seq>(near((vp,eng=engg),3,TRUE))</seq>
7 </progression></textquery></query>',1)>0 order by score(1) desc;

SCORE(1) TITLES
---------- --------------------------------------------------
76 v.p. of engineering
51 vice president, eng
51 vice pres. Engineering
31 vice president of development and engineering
5 vp of eng.
5 vp of engg

6 rows selected.

-- Now with CTXCAT since you can use the progressive relaxation with both
-- indextypes.


create table db_titles2 (titles varchar(50));

insert into db_titles2 values('vp of eng.');
insert into db_titles2 values('vp of engg');
insert into db_titles2 values('v.p. of engineering');
insert into db_titles2 values('vice president, eng');
insert into db_titles2 values('vice pres. Engineering');
insert into db_titles2 values('vice president of development and engineering');
insert into db_titles2 values('vp of marketing');
insert into db_titles2 values('software engineer');

commit;

create index title_idx2 on db_titles2(titles)
indextype is ctxsys.ctxcat parameters('lexer mylexp');

select titles from db_titles2 where catsearch(titles,
'<query><textquery grammar="context"><progression>
<seq>(vp of $(engineering))</seq>
<seq>(vice (pres equiv president) ( eng equiv engineering))</seq>
<seq>(near((vice,$(president),$(engineer)),6,TRUE))</seq>
<seq>(near((vp,eng=engg),3,TRUE))</seq>
</progression></textquery></query>','')>0;

SQL> select titles from db_titles2 where catsearch(titles,
2 '<query><textquery grammar="context"><progression>
3 <seq>(vp of $(engineering))</seq>
4 <seq>(vice (pres equiv president) ( eng equiv engineering))</seq>
5 <seq>(near((vice,$(president),$(engineer)),6,TRUE))</seq>
6 <seq>(near((vp,eng=engg),3,TRUE))</seq>
7 </progression></textquery></query>','')>0;

TITLES
--------------------------------------------------
v.p. of engineering
vice president, eng
vice pres. Engineering
vice president of development and engineering
vp of eng.
vp of engg

6 rows selected.

To summarize, progressive relaxation is a very useful tool for controlling
the relevancy of the search results. You can also restrict the number of rows
returned and also apply other transformations. Roger Ford has a great
tutorial on the subject:
</code> http://www.oracle.com/technology/products/text/htdocs/prog_relax.html <code>


Question #2

that is something we'll be able to do in the near future (beyond 10gR2), but not right now.


Rating

  (2 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Oracle thesaurus

vkeshav, June 27, 2007 - 3:24 pm UTC

On the same lines we are planning to use the oracle thesaurus to create a list of words and use it with the fuzzy search method.

For example when the input string contains INC. it will be converted to INCORPORATED, or Com. would be Company.
I know that the Oracle thesaurus (10g suite) supports this.

My question is can I specify conditions on which the inc or com is changed?
For example if my received string is a proper noun and should be kept as com and not changed to company

Any advise would be appreciated
Thanks

Query Relaxation don't work fine

Nacho, February 10, 2011 - 5:45 am UTC

hello,
i'm using oracle text and i see the new technique for text searching like query relaxation.
But, this technique seems don`t work when the first query sequence don`t returs rows. 
Example:
The language is SPANISH and the oracle version is 10.2.0.1
SQL> create table t (c varchar2(100));

Tabla creada.

SQL> insert into t values ('MI NUEVA COMPAÑIA ESPAÑOLA DE SEGUROS');

1 fila creada.

SQL> CREATE INDEX CTX_IX ON t(c) INDEXTYPE IS CTXSYS.CONTEXT PARAMETERS('SYNC(ON COMMIT)');

Índice creado.

SQL> SELECT c
  2  FROM t 
  3  WHERE CONTAINS (c, '
  4                      <query>
  5                         <textquery>
  6                           <progression>
  7                             <seq>compañia seguros</seq> 
  8                             <seq>compañia and seguros</seq>
  9                           </progression>
 10                         </textquery>
 11                      </query>') > 0 ;

ninguna fila seleccionada

However
SQL> SELECT c
  2  FROM t 
  3  WHERE CONTAINS (c, '
  4                      <query>
  5                         <textquery>
  6                           <progression>
  7                             <seq>compañia and seguros</seq>
  8                           </progression>
  9                         </textquery>
 10                      </query>') > 0 ;

C
--------------------------------------------------------------------------------
MI NUEVA COMPAÑIA ESPAÑOLA DE SEGUROS

When the first query within the sequence query no return rows, then the follow querys don't execute.

Is this an error?