Skip to Main Content

Breadcrumb

Question and Answer

Tom Kyte

Thanks for the question, Srimal.

Asked: September 14, 2001 - 4:26 pm UTC

Last updated: June 11, 2008 - 9:26 pm UTC

Version: 8.0.6

Viewed 1000+ times

You Asked

Hello Mr Tom,

I am trying to use Soundex search along with getting a particular number or rows and sort it at the same time.... Could you give me an idea as to how exactly a soundex search works... I know that using a soundex search does a full table scan and if I introduce group by, then the process is further slows down.

Could you possibly guide me as to what are the important aspects that I need to keep in mind when using a 'Soundex' .


I am also sending you a copy of the code that I am looking into..

There are two table A and B.. A is the Parent table and B is the child table.. with col1 joining A and B

select * from (select rownum r, t.* from (select A.col1, A.col2, B.col3 from A, B where a.col1 = b.col1 and soundex(a.col2) = soundex('Smith')) t where rownum < 25 ) where r > 1

I see a significant decrease in speed if I introduce a group by function (alternate to order by because of 8.0.6)

select * from (select rownum r, t.* from (select A.col1, A.col2, B.col3 from A, B where a.col1 = b.col1 and soundex(a.col2) = soundex('Smith') group by col1, col2, col3) t where rownum < 25 ) where r > 1


Thank you in advance
Srimal

and Tom said...

Well, in Oracle8i, you can index a function (so you could index the soundex of a column). See
</code> http://asktom.oracle.com/~tkyte/article1/index.html <code>
for details.

In Oracle8.0 and before, you might consider adding another column -- a "soundex" column and using a trigger before insert or update for each row to maintain this column. Then you can index it and a search like:

select * from t where soundex_column = soundex( 'some word' ) order by y;

an index on (soundex_column,y) would be able to use the index to find the answser and sort it.


Yes, you will see a decrease in the second query. It must return ALL rows, group them and then return the first 25. The first query just gets 25 rows.

Rating

  (12 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Thank you

Srimal, September 14, 2001 - 4:50 pm UTC

Thank you for a prompt reply... The solution that you presented answered a lot of my questions...



Difference in words

umesh, June 20, 2003 - 4:46 am UTC

Tom,
scott>select ename from emp ;

ENAME
-------
SMITH
ALLEN
WARD
JONES
MARTIN BLAKE CLARK SCOTT TURNER ADAMS JAMES


scott>select ename from temp ;

ENAME
----------SMITH ALLEN WARD JONES MARTIN BLAKE CLARK SCOTT TURNED ADAMS JAMES1

I have 2 columns in my D.B. I need to pick up values from emp table and temp table which differ in 1 letter or say 2 letters or say 3 letters ( according to my needs)

I start with JAMES I need a query which gives me JAMES and JAMES1 because there is a difference of 1 letter .

Adding to the complexity the 2 words of comparision may not sound the same

Thanks for your patience and good work






difference in words

umesh, June 20, 2003 - 4:52 am UTC

scott>select ename from emp;

ENAME
----------
SMITH
ALLEN
WARD
JONES
MARTIN
BLAKE
CLARK
SCOTT
à
TURNER
ADAMS
JAMES
à
Verri&#352;re


15 rows selected.

scott>select ename from temp;

ENAME
----------
SMITH
ALLEN
WARD
JONES
MARTIN
BLAKE
CLARK
SCOTT
à
TURNED
ADAMS
JAMES1
à
Verri&#352;re


15 rows selected.

I have 2 columns in my D.B. I need to pick up values from emp table and temp
table which differ in 1 letter or say 2 letters or say 3 letters ( according to
my needs)

I start with JAMES I need a query which gives me JAMES and JAMES1 because there
is a difference of 1 letter .

Adding to the complexity the 2 words of comparision may not sound the same

Thanks for your patience and good work


Tom Kyte
June 20, 2003 - 4:57 pm UTC

insufficient detail to make a sensible answer here ( and it is really a new question, not a followup to the orig question but...)


if you just want to know what words are in table1/table2 and not in the other -- MINUS or FULL OUTER JOINS can do that


select * from t1 minus select * from t2 <<<=== everything in t1 not in t2
select * from t2 minus select * from t1 <<<=== everything in t2 not in t1

Now, YOU have to figure out how the heck you match JAMES up with JAMES1 from those two subresults -- you'll need some sort of pattern match I suppose

use Levenshtein distance formula to find difference in words (for Umesh)

Barbara Boehmer, June 20, 2003 - 7:44 pm UTC

Tom,

How about using a function to calculate the Levenshtein distance between two strings to solve Umesh's problem of finding the difference in words.  Here is a link for further information about the Levenshtein distance formula:

http://www.merriampark.com/ld.htm

I have included a usage example below, similar to Umesh's problem.  Please let me know what you think.

Thanks,
Barbara


SQL> -- test data:
SQL> SELECT ename FROM emp2
  2  /

ENAME                                                                           
----------                                                                      
TURNER                                                                          
ADAMS                                                                           
JAMES                                                                           
FORD                                                                            
MILLER                                                                          

SQL> SELECT ename FROM temp
  2  /

ENAME                                                                           
----------                                                                      
TURNED                                                                          
ADAMS                                                                           
JAMES1                                                                          

SQL> -- function for calculating Levenshtein distance between two strings
SQL> -- (the number of characters that need to be changed, or alternatively
SQL> --  added or removed, to convert one string to the other):
SQL> CREATE OR REPLACE FUNCTION ld -- Levenshtein distance
  2    (p_source_string   IN VARCHAR2,
  3      p_target_string   IN VARCHAR2)
  4    RETURN             NUMBER
  5    DETERMINISTIC
  6  AS
  7    v_length_of_source    NUMBER := NVL (LENGTH (p_source_string), 0);
  8    v_length_of_target    NUMBER := NVL (LENGTH (p_target_string), 0);
  9    TYPE mytabtype IS     TABLE OF NUMBER INDEX BY BINARY_INTEGER;
 10    column_to_left         mytabtype;
 11    current_column         mytabtype;
 12    v_cost             NUMBER := 0;
 13  BEGIN
 14    IF v_length_of_source = 0 THEN
 15       RETURN v_length_of_target;
 16    ELSIF v_length_of_target = 0 THEN
 17       RETURN v_length_of_source;
 18    ELSE
 19       FOR j IN 0 .. v_length_of_target LOOP
 20         column_to_left(j) := j;
 21       END LOOP;
 22       FOR i IN 1.. v_length_of_source LOOP
 23         current_column(0) := i;
 24         FOR j IN 1 .. v_length_of_target LOOP
 25           IF SUBSTR (p_source_string, i, 1) =
 26          SUBSTR (p_target_string, j, 1)
 27           THEN v_cost := 0;
 28           ELSE v_cost := 1;
 29           END IF;
 30           current_column(j) := LEAST (current_column(j-1) + 1,
 31                       column_to_left(j) + 1,
 32                       column_to_left(j-1) + v_cost);
 33         END LOOP;
 34         FOR j IN 0 .. v_length_of_target  LOOP
 35           column_to_left(j) := current_column(j);
 36         END LOOP;
 37       END LOOP;
 38    END IF;
 39    RETURN current_column(v_length_of_target);
 40  END ld;
 41  /

Function created.

SQL> SHOW ERRORS
No errors.
SQL> -- select Levenshtein distance between ename values in emp2 and temp
SQL> -- (intentional cartesian product to compare all values):
SQL> SELECT   emp2.ename, temp.ename,
  2            ld (emp2.ename, temp.ename) AS lev_dist
  3  FROM     emp2, temp
  4  ORDER BY emp2.ename, temp.ename
  5  /

ENAME      ENAME        LEV_DIST                                                
---------- ---------- ----------                                                
ADAMS      ADAMS               0                                                
ADAMS      JAMES1              4                                                
ADAMS      TURNED              6                                                
FORD       ADAMS               5                                                
FORD       JAMES1              6                                                
FORD       TURNED              4                                                
JAMES      ADAMS               3                                                
JAMES      JAMES1              1                                                
JAMES      TURNED              5                                                
MILLER     ADAMS               6                                                
MILLER     JAMES1              6                                                
MILLER     TURNED              5                                                
TURNER     ADAMS               6                                                
TURNER     JAMES1              6                                                
TURNER     TURNED              1                                                

15 rows selected.

SQL> -- limit selection to Levenshtein distance of one or less:
SQL> SELECT   emp2.ename, temp.ename,
  2            ld (emp2.ename, temp.ename) AS lev_dist
  3  FROM     emp2, temp
  4  WHERE    ld (emp2.ename, temp.ename) <= 1
  5  ORDER BY emp2.ename, temp.ename
  6  /

ENAME      ENAME        LEV_DIST                                                
---------- ---------- ----------                                                
ADAMS      ADAMS               0                                                
JAMES      JAMES1              1                                                
TURNER     TURNED              1                                                
 

Tom Kyte
June 21, 2003 - 10:09 am UTC

do that on a realistic sized set of data ;)

gotta a couple of days? I was recently working with someone using that exact function. The cartesian products blow up fast (100 rows = 100x100 = 10,000 compares. 1,000 rows = 1000x1000=1,000,000 compares and so on).

We estimated that it would take a couple of weeks to finish.

So.... we created the cartesian product in a single table

c1 c2 score
--- ---- -------
... ... ...

leaving score "blank', then using the technique from
</code> http://asktom.oracle.com/pls/asktom/f?p=100:11:::::P11_QUESTION_ID:10498431232211 <code>

we wrote a Pro*C program to bulk fetch 100 rows out, compute the distance between the two and update it back in bulk. we got the distances computed using "parallel 32" in about 4-5 hours that way.





Hats off to you Barbara Boehmer !!!!

A reader, June 20, 2003 - 8:54 pm UTC


response to follow-up

Barbara Boehmer, June 21, 2003 - 5:36 pm UTC

Tom,

Thanks for the feedback. You're right that it does get very slow with large data sets. It sounds like you came up with a nifty solution to speed things up.

I have been using the function for data clean-up of non-validated data from other systems. However, I am looping through a cursor and just comparing one value at a time to the appropriate lookup table. I am also able to limit the number of rows that I have to compare to, based on other filter conditions, and am only using the function as a last resort, when there isn't an exact match or soundex match or match based on other criteria. I also do some general standardization, such as converting everything to upper case and removing double spaces and so forth, prior to any comparisons.

I have a weekly process that takes a total of about ten minutes to load four text files into staging tables via SQL*Loader (we are still using Oracle 8.1.7, so I can't use external tables like in 9i yet), identify about three hundred rows from the parent table out of about twenty thousand, and several hundred rows from the three child tables, clean up the data in many of the columns, then insert them into several Oracle tables on our system. So, it has worked well for me. It beats the heck out of cleaning it up manually or eliminating any data that can't be validated otherwise.

By the way, I should be receiving your book in the mail soon. I procrastinated on buying it until it was too late due to Wrox going out of business. However, a friend recently located a used copy, which I purchased, and should be receiving in the mail by Friday. I read Chapter One on-line. Making that first chapter available on-line is an excellent method of enticing people to purchase the book to read the rest, which I am looking forward to shortly.

Keep up the good work!

Thanks,
Barbara


Thanks to Barbara and Tom

umesh, June 23, 2003 - 12:16 am UTC


Comparing two strings

Kit, January 16, 2004 - 11:07 am UTC

HI Tom.

The package posted by the other user was something I needed but as you said , it could take time on large amounts of data.
do you have any other ideas of comparing two columns varchar and counting on the characters that are diff.

How does the soundex internals work.

Tom Kyte
January 16, 2004 - 11:37 am UTC

well, if you follow the link, i show how i did this exact thing (using pro*c)

soundex "internals" are actually documented. google soundex, the algorithm is out there.

Why is this not true

A reader, April 21, 2005 - 10:33 am UTC

select 1 from dual where soundex('ate')= soundex('eight');

no rows selected


Why is soundex calculating these sounds as different?

Thanks

Tom Kyte
April 22, 2005 - 8:52 am UTC

ops$tkyte@ORA9IR2> select soundex('ate'), soundex('eight') from dual;
 
SOUN SOUN
---- ----
A300 E230


Because the standard soundex function, as dictated by the algorith, always uses the first character of the word.  If the words do not start with the same character, not going to happen with that standard implementation of soundex. 

soundex to oracle text

LC, April 22, 2005 - 10:57 am UTC

Just a quick note on my experiences with soundex and searching...

This worked great for me the with an indexed soundexed_column on the table as described by Tom above. Well, it worked great until my tables grew to so many rows that the returned results were no longer useful. IE... after adding a few hundred thousand PERSON's and doing a soundex search on an obscure name it would return like 1500 rows that were not even close to the searched value. That's just how the soundex algorithm works - potentially lots of false positives on large data sets.

As "enhancing" our search was a federal requirement, I recently moved to Oracle Text using the contains operator with fuzzy matching. I have a slider in my web app that allows the users to control the search precision via fuzzy's parameters. Performance, returned results, and functionality are great.

Now that is still not going to make ate = eight, dick = richard or tom = thomas, but I also created a names thesaurus with common nicknames and alternate spellings for people's first names.

All in all, the oracle text option and already available scoring seem a much better choice for many implementations.

soudnex

A reader, June 11, 2008 - 12:59 pm UTC

Tom:
if soundex search for a magazine title, does it mean we have to create an index on the title column first and have that updated eitehr nightly or using a trigger on idnert/update.
Tom Kyte
June 11, 2008 - 4:05 pm UTC

if you want to use an index to retrieve data, then you would use a create index - yes.

if you wanted:

select * from t where soundex(title) = soundex(:x);

to use an index, you would create index i on t(soundex(title));

And it would be maintained just like any other index - you need to do nothing special.

soundex

A reader, June 11, 2008 - 7:12 pm UTC

Tom:

Why is the soundex index not getting created


SQL>  create index idx_ename on emp(soundex(ename));
 create index idx_ename on emp(soundex(ename))
                                       *
ERROR at line 1:
ORA-01031: insufficient privileges


  1* create index idx_ename on emp(ename)
SQL> /

Index created.

Tom Kyte
June 11, 2008 - 9:26 pm UTC

got version?

in old old releases, you needed more privileges to create function based indexes.

More to Explore

PL/SQL demos

Check out more PL/SQL tutorials on our LiveSQL tool.

PL/SQL docs

PL/SQL reference manual from the Oracle documentation library