Thanks for the question, Mike.

Asked: November 02, 2002 - 3:43 pm UTC

Last updated: December 02, 2004 - 11:40 am UTC

Version: 9.1.0.2

Viewed 1000+ times

You Asked

A developer in my company has recently begun lamenting that Oracle can't search more than one text field at a time. He has also mentioned in a mild, ribbing way, that mySQL has a generic function, built into the database, that allows a user to search for a fragment of text across all fields of a table without needing to specify the specific columns.

As a programmer I can think of many ways to "code" this (e.g. extra field where all text from varchar fields is concatenated as a After Update trigger) but none that don't overly complicate every table I create. I have heard that the mySQL version isn't indexed but is reasonably efficient. This feature intruiges me as many of the searching interfaces into the database aren't coded by my group and any built-in features would implicitly be available to the end users could help them (and lower my support costs).

If you had to write such functionality how might you start? Or, hopefully, is this functionallity already available and I just need to do more research. I am a long-time reader of this site and I know you are probably going to say "You should talk to the users and find out specifically what columns and what tables they want to search, build indexes, and call it a day". I can't, unfortunately, as they don't know what table or even what fields the data resides in that they are searching for and this "generic" search across a table would be very useful for them, and the client is (nearly) always right.

Thanks for any ideas you might throw my way in advance.

and Tom said...

There are quite a few ways to do this.

The first that jumps to mind is a ctxcat index:

</code> http://docs.oracle.com/cd/B10501_01/text.920/a96517/ind.htm#1011701


another is a text index on a function you write (this is how asktom works) that actually indexes any data you want to assemble.  See

http://asktom.oracle.com/pls/asktom/f?p=100:11:::::P11_QUESTION_ID:440419921146 <code>
for a simple example. Here on asktom I store the QUESTION you asked in one table. I store the ANSWER I gave in another table. I store the followups you supply in yet a third table (many rows -- one row for each of the followups you guys put and one row for my followup to your followup).

Now, when you search, I basically say:

select id from question_table where contains( indexed_text, 'your search') > 0

there you go -- I just searched 3 tables -- ACROSS many rows -- treating them as if they were one row in some virtual table somewhere. Hmm, does mysql do that?

Also -- put 1,000,000 rows in mysql with 1000 columns. Now -- using any technique mysql has, any at all -- perform a wildcard search on that data and return to me the answer in under 0.5 seconds. Now, do that in Oracle with either of the above options (and you can). Scale that 1 million up and up and up and see what happens. Hmmm

So, I would probably use intermedia here -- glue the fields together -- index that and let them use keyword, fuzzy, soundex, near, ands, ors, etc etc etc etc etc...

or -- it would be trivial to code a stored procedure that returns a ref cursor that dynamically built a where clause that where'ed on each field based on the columns it discovered in the data dictionary. should take about 3 minutes to code that one (and hey -- it did!)

scott@ORA920.US.ORACLE.COM> create or replace procedure query_table( p_tname in varchar2,
2 p_string in varchar2,
3 p_result_set out sys_refcursor )
4 authid current_user
5 as
6 l_query long := 'select * from ' || p_tname || ' where 1=0 ';
7 begin
8 dbms_application_info.set_client_info( '%' || p_string || '%' );
9 for x in ( select column_name from user_tab_columns where table_name = upper(p_tname) )
10 loop
11 l_query := l_query || ' or ' || x.column_name || ' like sys_context(''userenv'',''client_info'') ';
12 end loop;
13 open p_result_set for l_query;
14 end;
15 /

Procedure created.

scott@ORA920.US.ORACLE.COM>
scott@ORA920.US.ORACLE.COM> variable x refcursor
scott@ORA920.US.ORACLE.COM> set autoprint on
scott@ORA920.US.ORACLE.COM> exec query_table( 'emp', 'CLERK', :x );

PL/SQL procedure successfully completed.

EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO
---------- ---------- --------- ---------- --------- ---------- ---------- ----------
7369 SMITH CLERK 7902 17-DEC-80 800 20
7876 ADAMS CLERK 7788 23-MAY-87 1100 20
7900 JAMES CLERK 7698 03-DEC-81 950 30
7934 MILLER CLERK 7782 23-JAN-82 1300 10

scott@ORA920.US.ORACLE.COM> exec query_table( 'dept', 'BOSTON', :x );

PL/SQL procedure successfully completed.

DEPTNO DNAME LOC
---------- -------------- -------------
40 OPERATIONS BOSTON

but that would be about as efficient as mysql could get it -- no indexes, full scan -- brute force every field.

Rating

(17 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Follow up

Mike, November 04, 2002 - 1:10 pm UTC

Thank you. This should do the trick. A small follow up: How exactly to you create a CTXCAT index across multiple fields in a table? I have found several examples that index a single text field and a date field but none that give the example of multiple text fields. If possible I would prefer to offer the developers written by Oracle as opposed to some function I would write (such as the brute force method). Do you have an example up your sleve? Thanks again for the first reponse.

November 05, 2002 - 8:40 am UTC

Well, the ctxcat index is mostly about making what they wanted to do more efficient:

...
You use the CATSEARCH operator with a CTXCAT index mainly to improve mixed query performance. You specify your text query condition with text_query and your structured condition with structured_query. Internally, Oracle Text uses a combined b-tree index on text and structured columns to quickly produce results satisfying the query.
.....

so you could mix text and relational predicates together more efficiently. To directly address the original question

o the procedure does that for any table, showing that if Oracle forgot to add a feature, adding that feature can be very easy in most cases.

o the text index on the function makes it very very efficient, if you have a set of tables you know you want to do this to (and with more then a couple hundred rows -- you will be needing that cabability)

Sweet! But no work across a dblink :-(

steve shiflett, May 09, 2003 - 3:50 pm UTC

Needed this to work across a dblink. Reason being that I didn't have rights to create functions/procedures on that instance. So, created the functions on my own instance:
------------------------
function qt_like_dcab( p_tname in varchar2,
p_string in varchar2
) return util.generic_cursor authid current_user
as
v_query long := 'select * from ' || p_tname || '@dcab where 1=0 ';
v_result_set util.generic_cursor;
begin
dbms_application_info.set_client_info( '%' || p_string || '%' );
for x in ( select column_name from user_tab_columns where table_name = upper(p_tname))
loop
v_query := v_query || ' or ' || x.column_name || ' like sys_context(''userenv'',''client_info'') ';
end loop;
open v_result_set for v_query;
return v_result_set;
end;
------------------------
but got the error -

ORA-02070: database DCAB does not support operator SYS_CONTEXT in this context
ORA-06512: at "XMENORA.QT_LIKE_DCAB", line 13
ORA-06512: at line 1

May 09, 2003 - 5:11 pm UTC

use a regular bind variable. you do not need sys_context.

OK

Ram, January 04, 2004 - 12:40 am UTC

Dear Sir,
Suppose if I have a text string like
str := '4 score and 7 years ago'
How can I delete the word 'score' from it ?
and also If I want to insert a word like 'Cruel' into
'Hello world' so that it appears as 'Hello Cruel World'
Does Oracle has any inbuilt packages to do deletes and inserts from a string?
Thanks in advance.
Bye!

January 04, 2004 - 9:20 am UTC

replace, instr, substr, translate -- lots of string functions and in 10g, there will be a regular expression suite of functions as well.

OK

Ram, January 05, 2004 - 5:56 am UTC

Dear Sir,
Thanks for your response.Could you please specify some
simple examples for insert and delete of words from a sentence?.Sorry for the disturbance.
Thanks in advance.
Bye!

January 05, 2004 - 7:34 am UTC

ops$tkyte@ORA920PC> select replace( 'Hello there world', 'there', '' ) from dual;
 
REPLACE('HEL
------------
Hello  world
 


this is pretty basic stuff -- functions like replace, substr, instr exist in virtually all programming languages -- use the same techniques you would in any language.

Excellent!

Phil, January 05, 2004 - 10:37 am UTC

This is great Tom and I think it may be altered to addresse a problem which I have a few ideas about but think (actually I know!) you may well have a better one.

I have been tasked with building an interface from one Oracle instance (actually we may add more in future) to another. It needs to collect data meeting a certain condition from instance A and place a subset of this in an interface table in Instance B periodically. This needs to be automated, and the schedule of each collection needs to be controlled independently. In addition, the data following transformation by the core app (which I am afraid is not Oracle) needs to be updated back to the source systems. I hope this makes enough sense!

One idea I have is as follows; a control table in the interface schema lists the target tables, where clause and update statement for each external source table. A dbms job then runs which calls a SP to dynamically create a ref cursor on the external table and where clause. I've got that far - I need to open and loop through this query but can not - the number of columns is always different. I could make it the same as the interface table I guess and pad the query to match that but it seems a bit awkward.

Following extraction to the interface table the source is updated so that subsequent calls do not retrieve the data a 2nd time. And after a status change in the interface table, key elements of the data is updated back in the source tables.

I am sure you will have seen this requirement before and guess you may have a more elegant solution. Based on the ref cursor approach in this questionI think it could be used in this case?

January 05, 2004 - 1:47 pm UTC

if you can procedurally "build a query", "fetch it", "place into interface table", then you can build a single insert into select from

that is what you should do -- insert into interface_table (columns.... ) select <something>

I don't know what you mean by "padding" -- you would just select whatever columns you wanted?

don't do this procedurally, it's a single insert

SELECT statement only?

A reader, January 05, 2004 - 2:36 pm UTC

OK, that makes sense - I think sometimes a design can overlook a more straightforward approach so thanks for bringing me down to earth!

However, we need to ensure records only come across once. Would it be best to insert into abc(x,y,z) select e,f,g from blah where status=x and then update status=y to control this or open a cursor for update in your opinion?

We intend to use a dbms_job to look for records that need transfer and want to call this as frequently as possible. Is there a limitation on this frequency as we propose using an interval of 10 seconds? We have a horrible limitation that we can not alter the source app (ie we can't add a trigger on the source table)

January 06, 2004 - 7:38 am UTC

do you have to update the records as you pull them? you don't have a timestamp or some other field that would let you easily recognize "new records"?

(you cannot add a trigger, but you can update the tables directly??)

A reader, January 06, 2004 - 7:55 am UTC

Tom, yes there is a timestamp so we could use a control table to hold the last time we pulled and go for records after that. But, we are accessing tables from other instances - and so there is a small danger with frequent polling that the clocks could differ slightly (I think?). A pulled record gets updated in the source system so I think the most straightforward would be a cursor for update. However, this complicates the matter a little as we then are not following your advice of just using an insert. And as we want to make it dynamic to allow extra tables to be easily added this will add to that complexity.
Just going back a step - as this is dynamic, will the performance be less using execute immediate or ref cursors?
Thanks for taking the time to address this for us!

January 06, 2004 - 9:04 am UTC

no problem with clocks.

You start with:

select max(timestamp_col) from T@remote;

You have the LAST timestamp rembmered from last time -- select that out as well:

select last_timestamp from your_table;
if no data found (first refresh) then last_timestamp = jan 1'st 1970 or something.

Now, you

insert into your_copy
select * from t@remote where timestamp > last_timestamp and timestamp <= max_timestamp;

update or insert into your_table set last_timestamp = max_timestamp;

commit;

Thanks a million!

A reader, January 06, 2004 - 9:46 am UTC

Using TIMESTAMP to pick up changes

Gary, January 06, 2004 - 5:34 pm UTC

We've faced an issue in a similar situation where entries can be missed in this process.

Consider:
Session 1 inserts its record at 11:59:00 but doesn't commit until 12:01:00 (eg waiting for another lock or a user to hit commit).
Session 2 inserts its record at 11:59:30 and commits immediately.

Your job comes along at 12:00, picks up the session 2 change but cannot see the uncommitted Session 1 change.
Next run comes along, sees its last timestamp as 11:59:30, and so ignores the Session 1 change from 30 seconds eralier.

Having a flag to update would not encounter this issue.

January 06, 2004 - 8:40 pm UTC

ahh -- very very VERY good point.

if the other guy comes back -- can we use a snapshot log?

Good point

A reader, January 07, 2004 - 5:18 am UTC

Yes this is a potential problem but I was working on the assumption that the commit was the time the lastupdated field is populated. If not maybe a snapshot would be better but this would make it more complex and have an effect on performance.

The goal is to be able to react within seconds of a change and this is without modifying the source app at all (no triggers afraid). I was going to use a dbms_job and this lastupdated field but can use a status too which may resolve it.

Tom, what is the max frequency of a dbms_job? I have a simple test that inserts a row every 3 seconds and it seems to run less frequently that that, inserting two at a time.

January 07, 2004 - 8:13 am UTC

it is about 10 seconds in the latest software.

are these 9ir2 databases? there is another option, streams.

dbms_job minimum interval 8i

Phil, January 07, 2004 - 10:07 am UTC

No, afraid they are not the latest, the instances will be on 8.1.7. Is this less frequent?

January 07, 2004 - 6:12 pm UTC

no, not less frequent -- just no streams, less options.

OK

Roselyn, January 08, 2004 - 7:08 am UTC

Dear sir,
In the string 'How now brown cow',How can we exchange words
'How and Now' so that string appears as 'now How brown cow'?
Could you please show me the way to do it?
Thanks in advance.
Bye!

January 08, 2004 - 1:50 pm UTC

ops$tkyte@ORA920PC> select replace( replace( replace( 'How now brown cow', 'How', '$X$X$' ),
  2  'now', 'How'), '$X$X$', 'now' )
  3    from dual;
 
REPLACE(REPLACE(R
-----------------
now How brown cow

Aggregate text field searching

Senthil Swaminathan, February 16, 2004 - 8:50 am UTC

Searching VARCHAR2 column and NUMBER column

Praveen, November 23, 2004 - 6:54 am UTC

Hi Tom,

Is there any difference in performance between a number column and a varchar2 column while quering (same size approx), indexed or unindexed?

I had benchmarked myself and found no difference! Am I correct?

Thanks and regards

Praveen

November 23, 2004 - 7:47 am UTC

numbers are physically stored in a very "string like fashion" with 2 digits per byte. they are very much like strings in that regards

at the end of the day ALWAYS store

dates in dates
numbers in numbers
strings in strings

and never ever never never mix them. never put a number in a string, a date in a number and so on.

they are the same comparision wise since it is just bytes being compared to other bytes.

Numeric vs Varchar performance in an indexed column

Praveen, December 01, 2004 - 7:18 am UTC

Sorry, Tom, for bugging for the same problem again I had asked earlier(above). But
the issue came up again in my team a while ago and there are members who are in
favour of storing the varchar field as numeric (if possible) and they went even
to the bit-level details of internal representation of data to argue that storing
the column as numeric field rather than varchar field (after a one-time conversion)
will only take up lesser space than varchar counterpart and hence the bit-wise
comparison will be more faster during an index search. The incomming data is of
the format '9' .. '99999999'(varchar). This will be converted into 9....99999999
(numeric) before storing. This field is used only for querying and no other processing
of any kind happens on this field at all.

Example: '123' will take 3 bytes
123 will take only 2 bytes.

'123456789' will take 9 bytes
123456789 will take only 5 bytes.

Obviously there are performance differences, right?

December 01, 2004 - 9:53 am UTC

look -- if the data is a NUMBER, use a NUMBER.
if the data is DATE use a DATE
if the data is a string use a string.

that is all. use the correct datatype. If all of the values are NUMBERS and all of the values MUST BE NUMBERS and you are currently using a varchar2 -- you have done it wrong -- not because of "speed" or "size" but because you are storing a number in a string.

However, before they do that, ask

is 00012 a possible value? Because if it is, it will become 12.

this is not about speed
this is not about size
this is about using the proper datatype based on what the data "is". If this is a number, must be a number -- and you are using a varchar2, you've done it wrong (you'll let non-numbers slip in there).

Ref: Numeric vs Varchar performance in an indexed column

Praveen, December 01, 2004 - 9:35 pm UTC

Hi Tom,

Everyone agrees with your suggestion of identifying the correct datatype at the analysis stage itself. But this is the problem:

This is to allocate a unique serial number to each product (gas turbines) the client manufacture in-order to keep their complete future operational history in a datawarehouse. The size of certain tables which are frequently queried can go up till 10 million records, and those queries are mostly based on the serial number. The current format is alphanumeric, but consisting only of numerals (in the future alphanumeric characters may come, we thought(it will not start with '0')). So these are the considerations-

1) Compromise on the readability of serial numbers(easier to understand, if serial number is alphanumeric, under which category the turbines fall into) for speed and size.

OR

2) If we are not gaining much in the speed (or size) keep alphanumeric serial numbers.

The suggestion from the client is to keep alphanumeric format, but not necessarily.

From your above answer, I should guess that the client is right. Should I? But what still worries me is the calculation given in the previous question.

Could you please give me a more detailed answer?

Thanks for your time and effort,

regards

Praveen

December 01, 2004 - 9:53 pm UTC

Ok (you could do this yourself too -- it is pretty straight forward...)

ops$tkyte@ORA9IR2> drop table t;
 
Table dropped.
 
ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2> create table t
  2  as
  3  select id, to_char(id) idc
  4    from big_table.big_table;
 
Table created.
 
ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2> create index t_idx1 on t(id);
 
Index created.
 
ops$tkyte@ORA9IR2> create index t_idx2 on t(idc);
 
Index created.
 
ops$tkyte@ORA9IR2>
ops$tkyte@ORA9IR2> @trace
ops$tkyte@ORA9IR2> alter session set events '10046 trace name context forever, level 12';
 
Session altered.
 
ops$tkyte@ORA9IR2> declare
  2          l_rec t%rowtype;
  3  begin
  4          for x in ( select * from t )
  5          loop
  6                  select * into l_rec from t where id = x.id;
  7                  select * into l_rec from t where idc = x.idc;
  8          end loop;
  9  end;
 10  /
 
PL/SQL procedure successfully completed.


<b>so, my big table has about 1,000,000 rows (well, exactly) and ID was the primary key 1..1000000</b>


SELECT *
from
 t where id = :b1
                                                                                                      
                                                                                                      
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute 1000000    140.83     145.34          0          0          0           0
Fetch   1000000     29.72      29.91       2230    4002225          0     1000000
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   2000001    170.55     175.26       2230    4002225          0     1000000
                                                                                                      
Misses in library cache during parse: 1
Optimizer goal: CHOOSE
Parsing user id: 170     (recursive depth: 1)
                                                                                                      
Rows     Row Source Operation
-------  ---------------------------------------------------
1000000  TABLE ACCESS BY INDEX ROWID T (cr=4002225 r=2230 w=0 time=25662482 us)
1000000   INDEX RANGE SCAN T_IDX1 (cr=3002225 r=2230 w=0 time=15354098 us)(object id 34948)
                                                                                                      
                                                                                                      
Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  db file sequential read                      2230        0.03          0.22
********************************************************************************
                                                                                                      
SELECT *
from
 t where idc = :b1
                                                                                                      
                                                                                                      
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute 1000000    139.90     142.37          0          0          0           0
Fetch   1000000     29.21      29.32       2499    4002493          0     1000000
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   2000001    169.11     171.70       2499    4002493          0     1000000
                                                                                                      
Misses in library cache during parse: 1
Optimizer goal: CHOOSE
Parsing user id: 170     (recursive depth: 1)
                                                                                                      
Rows     Row Source Operation
-------  ---------------------------------------------------
1000000  TABLE ACCESS BY INDEX ROWID T (cr=4002493 r=2499 w=0 time=24230621 us)
1000000   INDEX RANGE SCAN T_IDX2 (cr=3002493 r=2499 w=0 time=14647736 us)(object id 34949)

<b>so, that effectively blows the performance argument right out the window...</b>


And lets see, 10,000,000 rows.

Lets double that (for the index keys)...

20,000,000 things.


Lets take that up by an order of magnitude -- just to be safe...

200,000,000 things.

times 4 bytes.

800,000,000 bytes.

I have a pendrive (usb thingy to stick in pc's) that holds almost that much.  I fail to see any "make or break the bank" sort of 'space savings' either.


<b>make your customer happy, keep it alpha numeric</b>

Ref: Numeric vs Varchar performance in an indexed column

Praveen, December 02, 2004 - 11:40 am UTC

Hi Tom,

I got your point. Always benchmark before jumping into conclusions. I have done that myself, simulating your example with a different, more realistic set of data.

And the customer is happy... infact very very happy with the reasons and your benchmark results.(and me too. ;))

Thankyou, Tom

with warm regards,

Praveen

Breadcrumb

Question and Answer

You Asked

and Tom said...

Rating

Comments

Follow up

Sweet! But no work across a dblink :-(

OK

OK

Excellent!

SELECT statement only?

Thanks a million!

Using TIMESTAMP to pick up changes

Good point

dbms_job minimum interval 8i

OK

Aggregate text field searching

Searching VARCHAR2 column and NUMBER column

Numeric vs Varchar performance in an indexed column

Ref: Numeric vs Varchar performance in an indexed column

Ref: Numeric vs Varchar performance in an indexed column

More to Explore

PL/SQL demos

PL/SQL docs