Hi, we are planning to use Oracle Text based search for searching contacts within our 10g database. We would like to leverage CATCTX indexes as data exisits in a structured fashion within multiple tables. We intend on building a concatenated data store.
I was wondering if you could help with some of the catsearch queries we are trying to build. We are trying to define the right thesaurus and synonym constructs to model some of the scenarios below.
1) Title based search: We have titles in our database such as:
"vp of eng."
"vp of engg"
"v.p. of engineering"
"vice president, eng"
"vice pres. Engineering"
"vice president of development and engineering"
"vp of marketing"
"software engineer"
So when a user wants to search by title " vp of engineering", the search should return all variations of "vp of engineering" but should not return software engineer or vp of marketing.
2) Gender based search:
Data: Raymond Lane
Raylene Jones
Search:
"Ray" - should return Raymond Lane and Raylene Jones
"Raymond" - should only return Raymond Lane
"Raylene" - should only return Raylene Jones
Any help will be greatly appreciated.
Thanks.
-Pawan
I asked Omar Alonso (product manager for Oracle Text) for some assistance on this one and he said:
Question #1
This is a good search quality question. A great benefit of Oracle Text is
that the API is powerful enough so you can attack the problem from different
angles. This problem looks like a good candidate for a new technique
introduced in 10g called progressive relaxation.
Progressive relaxation is a technique that uses query variation to maximize
precision while guaranteeing recall. Think of the user-entered query as
simply a sequence of words. You can put these words together in a variety of
ways to form text queries, some more precise than others. You run the variants
sequentially, from most precise to least precise, until you get the number of
hits you need.
The great advantage of progressive relaxation is that you can use it as
your search quality template and tune it until you like the desired output. So
we can split the problem in two steps. First is to came up with a good
relaxation and then build a query parser that will excersie the sequence of
relaxations. I present the first step since the query parser should be
easy to build on.
-- Create table
create table db_titles (titles varchar(50));
insert into db_titles values('vp of eng.');
insert into db_titles values('vp of engg');
insert into db_titles values('v.p. of engineering');
insert into db_titles values('vice president, eng');
insert into db_titles values('vice pres. Engineering');
insert into db_titles values('vice president of development and engineering');
insert into db_titles values('vp of marketing');
insert into db_titles values('software engineer');
commit;
-- Since the data can have a any number of characters as part of the terms,
-- it is useful to define how you plan to normalize the data.
-- For example, you can have "V.P.", "VP." or "VP" and you would like to
-- treat them the same: "VP". The Oracle Text lexer preferences has all the
-- different parameters for the lexer
begin
ctx_ddl.create_preference('mylexp','basic_lexer');
ctx_ddl.set_attribute('mylexp','skipjoins','.');
end;
-- Create Text "context" index
create index title_idx on db_titles(titles)
indextype is ctxsys.context parameters('lexer mylexp');
-- Progressive relaxation sequence:
--
-- 1: "vp" and stems "engineering". A stem operation returns the root of
-- a term. In case of "engineering" -> engineer,engineers.
-- 2: "vice" and equivalence of "pres" and "president"
-- follows by equivalence of "eng" and "engineer"
-- The operator equivalence specifies an acceptable substitution for a
-- word in a query.
-- 3: "vice" stems "president" stems "engineer"; using near in a 6 word clump
-- Near return a score based on the proximity of two or more query terms.
-- A clump is the smallest group of words in which all query terms occur.
-- 4: vp near (equivalence of eng and engg)
-- Note that in #3 and #4 you can do mix different operators
select score(1),titles from db_titles where contains(titles,
'<query><textquery><progression>
<seq>(vp of $(engineering))</seq>
<seq>(vice (pres equiv president) ( eng equiv engineering))</seq>
<seq>(near((vice,$(president),$(engineer)),6,TRUE))</seq>
<seq>(near((vp,eng=engg),3,TRUE))</seq>
</progression></textquery></query>',1)>0 order by score(1) desc;
SQL> select score(1),titles from db_titles where contains(titles,
2 '<query><textquery><progression>
3 <seq>(vp of $(engineering))</seq>
4 <seq>(vice (pres equiv president) ( eng equiv engineering))</seq>
5 <seq>(near((vice,$(president),$(engineer)),6,TRUE))</seq>
6 <seq>(near((vp,eng=engg),3,TRUE))</seq>
7 </progression></textquery></query>',1)>0 order by score(1) desc;
SCORE(1) TITLES
---------- --------------------------------------------------
76 v.p. of engineering
51 vice president, eng
51 vice pres. Engineering
31 vice president of development and engineering
5 vp of eng.
5 vp of engg
6 rows selected.
-- Now with CTXCAT since you can use the progressive relaxation with both
-- indextypes.
create table db_titles2 (titles varchar(50));
insert into db_titles2 values('vp of eng.');
insert into db_titles2 values('vp of engg');
insert into db_titles2 values('v.p. of engineering');
insert into db_titles2 values('vice president, eng');
insert into db_titles2 values('vice pres. Engineering');
insert into db_titles2 values('vice president of development and engineering');
insert into db_titles2 values('vp of marketing');
insert into db_titles2 values('software engineer');
commit;
create index title_idx2 on db_titles2(titles)
indextype is ctxsys.ctxcat parameters('lexer mylexp');
select titles from db_titles2 where catsearch(titles,
'<query><textquery grammar="context"><progression>
<seq>(vp of $(engineering))</seq>
<seq>(vice (pres equiv president) ( eng equiv engineering))</seq>
<seq>(near((vice,$(president),$(engineer)),6,TRUE))</seq>
<seq>(near((vp,eng=engg),3,TRUE))</seq>
</progression></textquery></query>','')>0;
SQL> select titles from db_titles2 where catsearch(titles,
2 '<query><textquery grammar="context"><progression>
3 <seq>(vp of $(engineering))</seq>
4 <seq>(vice (pres equiv president) ( eng equiv engineering))</seq>
5 <seq>(near((vice,$(president),$(engineer)),6,TRUE))</seq>
6 <seq>(near((vp,eng=engg),3,TRUE))</seq>
7 </progression></textquery></query>','')>0;
TITLES
--------------------------------------------------
v.p. of engineering
vice president, eng
vice pres. Engineering
vice president of development and engineering
vp of eng.
vp of engg
6 rows selected.
To summarize, progressive relaxation is a very useful tool for controlling
the relevancy of the search results. You can also restrict the number of rows
returned and also apply other transformations. Roger Ford has a great
tutorial on the subject:
</code>
http://www.oracle.com/technology/products/text/htdocs/prog_relax.html <code>
Question #2
that is something we'll be able to do in the near future (beyond 10gR2), but not right now.