Home>Question Details



Pawan -- Thanks for the question regarding "Oracle Text Synonym and Theasurus Questions ", version 10.2.0.2

Submitted on 14-Aug-2006 16:57 Central time zone
Last updated 28-Aug-2006 13:06

You Asked

Hi, we are planning to use Oracle Text based search for searching contacts within our 10g 
database. We would like to leverage CATCTX indexes as data exisits in a structured 
fashion within multiple tables. We intend on building a concatenated data store. 

I was wondering if you could help with some of the catsearch queries we are trying to 
build. We are trying to define the right thesaurus and synonym constructs to model some 
of the scenarios below. 

1) Title based search: We have titles in our database such as:
"vp of eng."
"vp of engg"
"v.p. of engineering" 
"vice president, eng"
"vice pres. Engineering"
"vice president of development and engineering"
"vp of marketing"
"software engineer"

So when a user wants to search by title " vp of engineering", the search should return 
all variations of "vp of engineering" but should not return software engineer or vp of 
marketing. 

2) Gender based search:
Data: Raymond Lane
Raylene Jones

Search:
"Ray" - should return Raymond Lane and Raylene Jones
"Raymond" - should only return Raymond Lane
"Raylene" - should only return Raylene Jones

Any help will be greatly appreciated.

Thanks.
-Pawan 
 

and we said...

I asked Omar Alonso (product manager for Oracle Text) for some assistance on this one and 
he said:


Question #1

This is a good search quality question. A great benefit of Oracle Text is
that the API is powerful enough so you can attack the problem from different
angles. This problem looks like a good candidate for a new technique 
introduced in 10g called progressive relaxation.

Progressive relaxation is a technique that uses query variation to maximize 
precision while guaranteeing recall.  Think of the user-entered query as 
simply a sequence of words.  You can put these words together in a variety of 
ways to form text queries, some more precise than others.  You run the variants
sequentially, from most precise to least precise, until you get the number of 
hits you need. 

The great advantage of progressive relaxation is that you can use it as
your search quality template and tune it until you like the desired output. So
we can split the problem in two steps. First is to came up with a good 
relaxation and then build a query parser that will excersie the sequence of
relaxations. I present the first step since the query parser should be
easy to build on. 

-- Create table

create table db_titles (titles varchar(50));

insert into db_titles values('vp of eng.');
insert into db_titles values('vp of engg');
insert into db_titles values('v.p. of engineering'); 
insert into db_titles values('vice president, eng');
insert into db_titles values('vice pres. Engineering');
insert into db_titles values('vice president of development and engineering');
insert into db_titles values('vp of marketing');
insert into db_titles values('software engineer');

commit;

-- Since the data can have a any number of characters as part of the terms,
-- it is useful to define how you plan to normalize the data.
-- For example, you can have "V.P.", "VP." or "VP" and you would like to 
-- treat them the same: "VP". The Oracle Text lexer preferences has all the
-- different parameters for the lexer

begin
 ctx_ddl.create_preference('mylexp','basic_lexer');
 ctx_ddl.set_attribute('mylexp','skipjoins','.');
end;

-- Create Text "context" index

create index title_idx on db_titles(titles)
indextype is ctxsys.context parameters('lexer mylexp');

-- Progressive relaxation sequence:
--
--  1: "vp" and stems "engineering". A stem operation returns the root of 
--     a term. In case of "engineering" -> engineer,engineers.
--  2: "vice" and equivalence of "pres" and "president" 
--              follows by equivalence of "eng" and "engineer"
--     The operator equivalence specifies an acceptable substitution for a 
--     word in a query.
--  3: "vice" stems "president" stems "engineer"; using near in a 6 word clump
--     Near return a score based on the proximity of two or more query terms.
--     A clump is the smallest group of words in which all query terms occur.
--  4: vp near (equivalence of eng and engg)
--  Note that in #3 and #4 you can do mix different operators

select score(1),titles from db_titles where contains(titles,
   '<query><textquery><progression>
   <seq>(vp of $(engineering))</seq>
   <seq>(vice (pres equiv president) ( eng equiv engineering))</seq>
   <seq>(near((vice,$(president),$(engineer)),6,TRUE))</seq>
   <seq>(near((vp,eng=engg),3,TRUE))</seq>
   </progression></textquery></query>',1)>0 order by score(1) desc;

SQL> select score(1),titles from db_titles where contains(titles,
  2     '<query><textquery><progression>
  3     <seq>(vp of $(engineering))</seq>
  4     <seq>(vice (pres equiv president) ( eng equiv engineering))</seq>
  5     <seq>(near((vice,$(president),$(engineer)),6,TRUE))</seq>
  6     <seq>(near((vp,eng=engg),3,TRUE))</seq>
  7     </progression></textquery></query>',1)>0 order by score(1) desc;

  SCORE(1) TITLES
---------- --------------------------------------------------
        76 v.p. of engineering
        51 vice president, eng
        51 vice pres. Engineering
        31 vice president of development and engineering
         5 vp of eng.
         5 vp of engg

6 rows selected.

-- Now with CTXCAT since you can use the progressive relaxation with both
-- indextypes.


create table db_titles2 (titles varchar(50));

insert into db_titles2 values('vp of eng.');
insert into db_titles2 values('vp of engg');
insert into db_titles2 values('v.p. of engineering'); 
insert into db_titles2 values('vice president, eng');
insert into db_titles2 values('vice pres. Engineering');
insert into db_titles2 values('vice president of development and engineering');
insert into db_titles2 values('vp of marketing');
insert into db_titles2 values('software engineer');

commit;

create index title_idx2 on db_titles2(titles)
indextype is ctxsys.ctxcat parameters('lexer mylexp');

select titles from db_titles2 where catsearch(titles,
   '<query><textquery grammar="context"><progression>
   <seq>(vp of $(engineering))</seq>
   <seq>(vice (pres equiv president) ( eng equiv engineering))</seq>
   <seq>(near((vice,$(president),$(engineer)),6,TRUE))</seq>
   <seq>(near((vp,eng=engg),3,TRUE))</seq>
   </progression></textquery></query>','')>0;

SQL> select titles from db_titles2 where catsearch(titles,
  2     '<query><textquery grammar="context"><progression>
  3     <seq>(vp of $(engineering))</seq>
  4     <seq>(vice (pres equiv president) ( eng equiv engineering))</seq>
  5     <seq>(near((vice,$(president),$(engineer)),6,TRUE))</seq>
  6     <seq>(near((vp,eng=engg),3,TRUE))</seq>
  7     </progression></textquery></query>','')>0;

TITLES
--------------------------------------------------
v.p. of engineering
vice president, eng
vice pres. Engineering
vice president of development and engineering
vp of eng.
vp of engg

6 rows selected.

To summarize, progressive relaxation is a very useful tool for controlling
the relevancy of the search results. You can also restrict the number of rows
returned and also apply other transformations. Roger Ford has a great
tutorial on the subject:
http://www.oracle.com/technology/products/text/htdocs/prog_relax.html


Question #2

that is something we'll be able to do in the near future (beyond 10gR2), but not right 
now.
 

Reviews    
3 stars Oracle thesaurus   June 27, 2007 - 3pm Central time zone
Reviewer: vkeshav from NY,USA
On the same lines we are planning to use the oracle thesaurus to create a list of words and use it with the fuzzy search method.

For example when the input string contains INC. it will be converted to INCORPORATED, or Com. would be Company.
I know that the Oracle thesaurus (10g suite) supports this.

My question is can I specify conditions on which the inc or com is changed?
For example if my received string is a proper noun and should be kept as com and not changed to company

Any advise would be appreciated
Thanks


Write a Review
 


All information and materials provided here are provided "as-is"; Oracle disclaims all express and implied warranties, including, the implied warranties of merchantability or fitness for a particular use. Oracle shall not be liable for any damages, including, direct, indirect, incidental, special or consequential damages for loss of profits, revenue, data or data use, incurred by you or any third party in connection with the use of this information or these materials.

About Oracle | Legal Notices and Terms of Use | Privacy Statement