Thank you
Srimal, September 14, 2001 - 4:50 pm UTC
Thank you for a prompt reply... The solution that you presented answered a lot of my questions...
Difference in words
umesh, June 20, 2003 - 4:46 am UTC
Tom,
scott>select ename from emp ;
ENAME
-------
SMITH
ALLEN
WARD
JONES
MARTIN BLAKE CLARK SCOTT TURNER ADAMS JAMES
scott>select ename from temp ;
ENAME
----------SMITH ALLEN WARD JONES MARTIN BLAKE CLARK SCOTT TURNED ADAMS JAMES1
I have 2 columns in my D.B. I need to pick up values from emp table and temp table which differ in 1 letter or say 2 letters or say 3 letters ( according to my needs)
I start with JAMES I need a query which gives me JAMES and JAMES1 because there is a difference of 1 letter .
Adding to the complexity the 2 words of comparision may not sound the same
Thanks for your patience and good work
difference in words
umesh, June 20, 2003 - 4:52 am UTC
scott>select ename from emp;
ENAME
----------
SMITH
ALLEN
WARD
JONES
MARTIN
BLAKE
CLARK
SCOTT
à
TURNER
ADAMS
JAMES
à
VerriŠre
15 rows selected.
scott>select ename from temp;
ENAME
----------
SMITH
ALLEN
WARD
JONES
MARTIN
BLAKE
CLARK
SCOTT
à
TURNED
ADAMS
JAMES1
à
VerriŠre
15 rows selected.
I have 2 columns in my D.B. I need to pick up values from emp table and temp
table which differ in 1 letter or say 2 letters or say 3 letters ( according to
my needs)
I start with JAMES I need a query which gives me JAMES and JAMES1 because there
is a difference of 1 letter .
Adding to the complexity the 2 words of comparision may not sound the same
Thanks for your patience and good work
June 20, 2003 - 4:57 pm UTC
insufficient detail to make a sensible answer here ( and it is really a new question, not a followup to the orig question but...)
if you just want to know what words are in table1/table2 and not in the other -- MINUS or FULL OUTER JOINS can do that
select * from t1 minus select * from t2 <<<=== everything in t1 not in t2
select * from t2 minus select * from t1 <<<=== everything in t2 not in t1
Now, YOU have to figure out how the heck you match JAMES up with JAMES1 from those two subresults -- you'll need some sort of pattern match I suppose
use Levenshtein distance formula to find difference in words (for Umesh)
Barbara Boehmer, June 20, 2003 - 7:44 pm UTC
Tom,
How about using a function to calculate the Levenshtein distance between two strings to solve Umesh's problem of finding the difference in words. Here is a link for further information about the Levenshtein distance formula:
http://www.merriampark.com/ld.htm
I have included a usage example below, similar to Umesh's problem. Please let me know what you think.
Thanks,
Barbara
SQL> -- test data:
SQL> SELECT ename FROM emp2
2 /
ENAME
----------
TURNER
ADAMS
JAMES
FORD
MILLER
SQL> SELECT ename FROM temp
2 /
ENAME
----------
TURNED
ADAMS
JAMES1
SQL> -- function for calculating Levenshtein distance between two strings
SQL> -- (the number of characters that need to be changed, or alternatively
SQL> -- added or removed, to convert one string to the other):
SQL> CREATE OR REPLACE FUNCTION ld -- Levenshtein distance
2 (p_source_string IN VARCHAR2,
3 p_target_string IN VARCHAR2)
4 RETURN NUMBER
5 DETERMINISTIC
6 AS
7 v_length_of_source NUMBER := NVL (LENGTH (p_source_string), 0);
8 v_length_of_target NUMBER := NVL (LENGTH (p_target_string), 0);
9 TYPE mytabtype IS TABLE OF NUMBER INDEX BY BINARY_INTEGER;
10 column_to_left mytabtype;
11 current_column mytabtype;
12 v_cost NUMBER := 0;
13 BEGIN
14 IF v_length_of_source = 0 THEN
15 RETURN v_length_of_target;
16 ELSIF v_length_of_target = 0 THEN
17 RETURN v_length_of_source;
18 ELSE
19 FOR j IN 0 .. v_length_of_target LOOP
20 column_to_left(j) := j;
21 END LOOP;
22 FOR i IN 1.. v_length_of_source LOOP
23 current_column(0) := i;
24 FOR j IN 1 .. v_length_of_target LOOP
25 IF SUBSTR (p_source_string, i, 1) =
26 SUBSTR (p_target_string, j, 1)
27 THEN v_cost := 0;
28 ELSE v_cost := 1;
29 END IF;
30 current_column(j) := LEAST (current_column(j-1) + 1,
31 column_to_left(j) + 1,
32 column_to_left(j-1) + v_cost);
33 END LOOP;
34 FOR j IN 0 .. v_length_of_target LOOP
35 column_to_left(j) := current_column(j);
36 END LOOP;
37 END LOOP;
38 END IF;
39 RETURN current_column(v_length_of_target);
40 END ld;
41 /
Function created.
SQL> SHOW ERRORS
No errors.
SQL> -- select Levenshtein distance between ename values in emp2 and temp
SQL> -- (intentional cartesian product to compare all values):
SQL> SELECT emp2.ename, temp.ename,
2 ld (emp2.ename, temp.ename) AS lev_dist
3 FROM emp2, temp
4 ORDER BY emp2.ename, temp.ename
5 /
ENAME ENAME LEV_DIST
---------- ---------- ----------
ADAMS ADAMS 0
ADAMS JAMES1 4
ADAMS TURNED 6
FORD ADAMS 5
FORD JAMES1 6
FORD TURNED 4
JAMES ADAMS 3
JAMES JAMES1 1
JAMES TURNED 5
MILLER ADAMS 6
MILLER JAMES1 6
MILLER TURNED 5
TURNER ADAMS 6
TURNER JAMES1 6
TURNER TURNED 1
15 rows selected.
SQL> -- limit selection to Levenshtein distance of one or less:
SQL> SELECT emp2.ename, temp.ename,
2 ld (emp2.ename, temp.ename) AS lev_dist
3 FROM emp2, temp
4 WHERE ld (emp2.ename, temp.ename) <= 1
5 ORDER BY emp2.ename, temp.ename
6 /
ENAME ENAME LEV_DIST
---------- ---------- ----------
ADAMS ADAMS 0
JAMES JAMES1 1
TURNER TURNED 1
June 21, 2003 - 10:09 am UTC
do that on a realistic sized set of data ;)
gotta a couple of days? I was recently working with someone using that exact function. The cartesian products blow up fast (100 rows = 100x100 = 10,000 compares. 1,000 rows = 1000x1000=1,000,000 compares and so on).
We estimated that it would take a couple of weeks to finish.
So.... we created the cartesian product in a single table
c1 c2 score
--- ---- -------
... ... ...
leaving score "blank', then using the technique from
</code>
http://asktom.oracle.com/pls/asktom/f?p=100:11:::::P11_QUESTION_ID:10498431232211 <code>
we wrote a Pro*C program to bulk fetch 100 rows out, compute the distance between the two and update it back in bulk. we got the distances computed using "parallel 32" in about 4-5 hours that way.
Hats off to you Barbara Boehmer !!!!
A reader, June 20, 2003 - 8:54 pm UTC
response to follow-up
Barbara Boehmer, June 21, 2003 - 5:36 pm UTC
Tom,
Thanks for the feedback. You're right that it does get very slow with large data sets. It sounds like you came up with a nifty solution to speed things up.
I have been using the function for data clean-up of non-validated data from other systems. However, I am looping through a cursor and just comparing one value at a time to the appropriate lookup table. I am also able to limit the number of rows that I have to compare to, based on other filter conditions, and am only using the function as a last resort, when there isn't an exact match or soundex match or match based on other criteria. I also do some general standardization, such as converting everything to upper case and removing double spaces and so forth, prior to any comparisons.
I have a weekly process that takes a total of about ten minutes to load four text files into staging tables via SQL*Loader (we are still using Oracle 8.1.7, so I can't use external tables like in 9i yet), identify about three hundred rows from the parent table out of about twenty thousand, and several hundred rows from the three child tables, clean up the data in many of the columns, then insert them into several Oracle tables on our system. So, it has worked well for me. It beats the heck out of cleaning it up manually or eliminating any data that can't be validated otherwise.
By the way, I should be receiving your book in the mail soon. I procrastinated on buying it until it was too late due to Wrox going out of business. However, a friend recently located a used copy, which I purchased, and should be receiving in the mail by Friday. I read Chapter One on-line. Making that first chapter available on-line is an excellent method of enticing people to purchase the book to read the rest, which I am looking forward to shortly.
Keep up the good work!
Thanks,
Barbara
Thanks to Barbara and Tom
umesh, June 23, 2003 - 12:16 am UTC
Comparing two strings
Kit, January 16, 2004 - 11:07 am UTC
HI Tom.
The package posted by the other user was something I needed but as you said , it could take time on large amounts of data.
do you have any other ideas of comparing two columns varchar and counting on the characters that are diff.
How does the soundex internals work.
January 16, 2004 - 11:37 am UTC
well, if you follow the link, i show how i did this exact thing (using pro*c)
soundex "internals" are actually documented. google soundex, the algorithm is out there.
Why is this not true
A reader, April 21, 2005 - 10:33 am UTC
select 1 from dual where soundex('ate')= soundex('eight');
no rows selected
Why is soundex calculating these sounds as different?
Thanks
April 22, 2005 - 8:52 am UTC
ops$tkyte@ORA9IR2> select soundex('ate'), soundex('eight') from dual;
SOUN SOUN
---- ----
A300 E230
Because the standard soundex function, as dictated by the algorith, always uses the first character of the word. If the words do not start with the same character, not going to happen with that standard implementation of soundex.
soundex to oracle text
LC, April 22, 2005 - 10:57 am UTC
Just a quick note on my experiences with soundex and searching...
This worked great for me the with an indexed soundexed_column on the table as described by Tom above. Well, it worked great until my tables grew to so many rows that the returned results were no longer useful. IE... after adding a few hundred thousand PERSON's and doing a soundex search on an obscure name it would return like 1500 rows that were not even close to the searched value. That's just how the soundex algorithm works - potentially lots of false positives on large data sets.
As "enhancing" our search was a federal requirement, I recently moved to Oracle Text using the contains operator with fuzzy matching. I have a slider in my web app that allows the users to control the search precision via fuzzy's parameters. Performance, returned results, and functionality are great.
Now that is still not going to make ate = eight, dick = richard or tom = thomas, but I also created a names thesaurus with common nicknames and alternate spellings for people's first names.
All in all, the oracle text option and already available scoring seem a much better choice for many implementations.
soudnex
A reader, June 11, 2008 - 12:59 pm UTC
Tom:
if soundex search for a magazine title, does it mean we have to create an index on the title column first and have that updated eitehr nightly or using a trigger on idnert/update.
June 11, 2008 - 4:05 pm UTC
if you want to use an index to retrieve data, then you would use a create index - yes.
if you wanted:
select * from t where soundex(title) = soundex(:x);
to use an index, you would create index i on t(soundex(title));
And it would be maintained just like any other index - you need to do nothing special.
soundex
A reader, June 11, 2008 - 7:12 pm UTC
Tom:
Why is the soundex index not getting created
SQL> create index idx_ename on emp(soundex(ename));
create index idx_ename on emp(soundex(ename))
*
ERROR at line 1:
ORA-01031: insufficient privileges
1* create index idx_ename on emp(ename)
SQL> /
Index created.
June 11, 2008 - 9:26 pm UTC
got version?
in old old releases, you needed more privileges to create function based indexes.