Thanks for the question, VSN.

Asked: August 18, 2023 - 7:01 pm UTC

Last updated: September 06, 2023 - 4:24 am UTC

Version: Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production Version 19.18.0.0.0

Viewed 1000+ times

You Asked

Dear Tom,
I have 2 tables and I am finding the similarity between the fields using the oracle package: utl_match.jaro_winkler_similarity
I am facing the performance issue while accessing the query as a view. I need it as a view because I need to find the similarity for the live data by comparing the 2 tables fields. Both tables get inserts and updates frequently.
The table r_data has about 200K and the table m_data has about 400K records.
I need help with sql query.


with r_data as
(
  select cust_id,address addr from
  (
  select 10 cust_id,'9 Help Street, Level 4' address from dual union all
  select 11 cust_id,'22 Victoria Street' address from dual union all
  select 12 cust_id,'1495 Franklin Str' address from dual union all
  select 13 cust_id,'30 Hasivim St.,Petah-Tikva' address from dual union all
  select 14 cust_id,'2 Jakaranda St' address from dual union all
  select 15 cust_id,'61, Science Park Rd' address from dual union all
  select 16 cust_id,'61, Social park road' address from dual union all
  select 17 cust_id,'Av. Hermanos Escobar 5756' address from dual union all
  select 18 cust_id,'Ave. Hermanos Escobar 5756' address from dual union all
  select 19 cust_id,'8000 W FLORISSANT AVE' address from dual union all
  select 20 cust_id,'8600 MEMORIAL PKWY SW' address from dual union all
  select 21 cust_id,'8200 FLORISSANTMEMORIALWAYABOVE SW' address from dual union all
  select 22 cust_id,'8600 MEMORIALFLORISSANT PKWY SW' address from dual
  ) t1
)
SELECT 
m_id,
r_addr,
m_addr,
address_similarity
from
(
SELECT 
m_id,
r_addr,
m_addr,
address_similarity,
row_number() over (partition by m_id order by address_similarity desc) rn
from
(
SELECT 
m_data.id m_id,
r_data.addr r_addr,
m_data.address m_addr,
utl_match.jaro_winkler_similarity(m_data.address,r_data.addr) address_similarity
from
(
 SELECT id,address from
 (
 select 100 id,'9 Help Street, Level 3' address from dual union all
 select 110 id,'22 Victoria Stt' address from dual union all
 select 120 id,'1495 Franklin Street.' address from dual union all
 select 130 id,'30 Hasivim St.,Petah-Tikva' address from dual union all
 select 140 id,'2 Jakaranda St' address from dual union all
 select 150 id,'61, Science Park Rd' address from dual union all
 select 160 id,'61, Social park road' address from dual union all
 select 170 id,'Av. Hermanos Escobar 5756' address from dual union all
 select 180 id,'Ave. Hermanos Escobar 57566' address from dual union all
 select 190 id,'8000 W FLORISSANT Ave.' address from dual union all
 select 200 id,'8600 MEMORIAL PKWY SW' address from dual union all
 select 210 id,'8200 FLORISSANTMEMORIALWAYABOVE SW' address from dual union all
 select 220 id,'8600 MEMORIALFLORISSANT PKWY SW.' address from dual
 ) m
) m_data,r_data
)
) WHERE rn<=1 AND address_similarity>=97;

and Chris said...

The query being in a view is irrelevant - the problem is the joins!

Currently, the query has a cross join between the tables. Even on small tables this is a performance killer. You can see this by the MERGE JOIN CARTESIAN line in the plan:

create table t1 as 
select cust_id,address addr from
  (
  select 10 cust_id,'9 Help Street, Level 4' address from dual union all
  select 11 cust_id,'22 Victoria Street' address from dual union all
  select 12 cust_id,'1495 Franklin Str' address from dual union all
  select 13 cust_id,'30 Hasivim St.,Petah-Tikva' address from dual union all
  select 14 cust_id,'2 Jakaranda St' address from dual union all
  select 15 cust_id,'61, Science Park Rd' address from dual union all
  select 16 cust_id,'61, Social park road' address from dual union all
  select 17 cust_id,'Av. Hermanos Escobar 5756' address from dual union all
  select 18 cust_id,'Ave. Hermanos Escobar 5756' address from dual union all
  select 19 cust_id,'8000 W FLORISSANT AVE' address from dual union all
  select 20 cust_id,'8600 MEMORIAL PKWY SW' address from dual union all
  select 21 cust_id,'8200 FLORISSANTMEMORIALWAYABOVE SW' address from dual union all
  select 22 cust_id,'8600 MEMORIALFLORISSANT PKWY SW' address from dual
  );
  
create table t2 as
SELECT id,address from
 (
 select 100 id,'9 Help Street, Level 3' address from dual union all
 select 110 id,'22 Victoria Stt' address from dual union all
 select 120 id,'1495 Franklin Street.' address from dual union all
 select 130 id,'30 Hasivim St.,Petah-Tikva' address from dual union all
 select 140 id,'2 Jakaranda St' address from dual union all
 select 150 id,'61, Science Park Rd' address from dual union all
 select 160 id,'61, Social park road' address from dual union all
 select 170 id,'Av. Hermanos Escobar 5756' address from dual union all
 select 180 id,'Ave. Hermanos Escobar 57566' address from dual union all
 select 190 id,'8000 W FLORISSANT Ave.' address from dual union all
 select 200 id,'8600 MEMORIAL PKWY SW' address from dual union all
 select 210 id,'8200 FLORISSANTMEMORIALWAYABOVE SW' address from dual union all
 select 220 id,'8600 MEMORIALFLORISSANT PKWY SW.' address from dual
 );

set serveroutput off
 
select
  m_id,
  r_addr,
  m_addr,
  address_similarity
from
  (
    select
      m_id,
      r_addr,
      m_addr,
      address_similarity,
      row_number () over (partition by m_id
        order by address_similarity desc
      ) rn
    from
      (
        select
          m_data.id                                                       m_id,
          r_data.addr                                                     r_addr,
          m_data.address                                                  m_addr,
          utl_match.jaro_winkler_similarity (m_data.address, r_data.addr) address_similarity
        from t2 m_data, t1 r_data
      )
  )
where rn <= 1
and   address_similarity >= 97;

select * 
from   table(dbms_xplan.display_cursor( format => 'BASIC'));

-----------------------------------------
| Id  | Operation                | Name |
-----------------------------------------
|   0 | SELECT STATEMENT         |      |
|   1 |  VIEW                    |      |
|   2 |   WINDOW SORT PUSHED RANK|      |
|   3 |    MERGE JOIN CARTESIAN  |      |
|   4 |     TABLE ACCESS FULL    | T2   |
|   5 |     BUFFER SORT          |      |
|   6 |      TABLE ACCESS FULL   | T1   |
-----------------------------------------

You can address this by joining on the match function:

select * from (
select
  m_data.id                                                       m_id,
  r_data.addr                                                     r_addr,
  m_data.address                                                  m_addr,
  row_number () over (partition by m_data.id
    order by utl_match.jaro_winkler_similarity (m_data.address, r_data.addr) desc
  ) rn
from  t2 m_data
join  t1 r_data
on utl_match.jaro_winkler_similarity (m_data.address, r_data.addr) >= 97
)
where  rn = 1;

select * 
from   table(dbms_xplan.display_cursor( format => 'BASIC'));

-----------------------------------------
| Id  | Operation                | Name |
-----------------------------------------
|   0 | SELECT STATEMENT         |      |
|   1 |  VIEW                    |      |
|   2 |   WINDOW SORT PUSHED RANK|      |
|   3 |    NESTED LOOPS          |      |
|   4 |     TABLE ACCESS FULL    | T2   |
|   5 |     TABLE ACCESS FULL    | T1   |
-----------------------------------------

This should help, but there's a good chance the query will still be slow. If you have a nested loops join as above, the database will full scan the inner table once for each row from the outer table. That's still going to take a while.

As you're accessing columns from both tables in the function it's unlikely you'll be able to create suitable indexes to help. You'll likely need to rethink your approach to get this to perform well.

Rating

(8 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Similarity computation itself is the (or at least "a major") performance killer

mathguy, August 21, 2023 - 6:33 pm UTC

It is not clear what kind of performance you are expecting. Is well over one day per single run of the query considered OK?

Here is a small experiment that shows that just the computation of the Jaro-Winkler similarity score for all your address pairs will cause execution times of that order of magnitude. To be crystal clear: this will have nothing to do with querying a view; it will have nothing to do with join (Cartesian or otherwise), with I/O operations (to read one table repeatedly, even if from memory buffer), or with ordering results to apply ROW_NUMBER and fetch top match for each address. The experiment measures ONLY the time it takes to run the Jaro-Winkler calculation itself.

Your tables have 200k and 400k rows respectively. No matter how you structure the query (join on JW score greater than 97, or Cartesian join with that condition in the WHERE clause, or any other way you can come up with), you WILL have to compute the JW similarity score for 200k * 400k = 80 billion pairs of addresses. In my experiment, shown below, I create 1 million pairs of "addresses". Each "address" is a random string of printable characters of length 20, and the pairs appear as the two columns of a single table, so there is no join of any kind. The first SELECT query shows that the data is read in less than 0.1 seconds on my system; the query to compute the JW similarity score for all pairs takes 1.8 seconds. This is for 1 million pairs; if we extrapolate that linearly to 80 billion pairs, the time required just for the similarity computations is 40 hours.

create table t nologging as
  select  dbms_random.string('p', 20) as addr1,
          dbms_random.string('p', 20) as addr2
  from    dual
  connect by level <= 1e6
;

select min(greatest(addr1, addr2)) as res  --  to show that I/O is not material to the issue
from   t
;

select max(utl_match.jaro_winkler_similarity(addr1, addr2)) as jw_sim  -- this will take time!
from   t
;

So... to improve performance (whatever that means for your real-life case), just modifying the query is not likely to be the answer.

You said there are frequent DML operations on both tables; but do they affect many rows? Or are there, let's say, less than 100 rows changed in each table every day? If so, then one possibility is to create a materialized view, with fast refresh on COMMIT - just to get all the pairs where the JW score is greater than 97. Assuming that the output isn't too many rows (not 80 billion, but say in the hundreds of thousands), an additional view on top of this MV can calculate the top match for each address - that should not be the performance stopper. Computing the MV the first time may still take over a day, but this will only need to be done once.

August 22, 2023 - 9:21 am UTC

Good analysis.

I doubt it'll be possible to build a fast refreshable MV though. The OP will probably have to make a DIY MV and build something to process matches as they're inserted or come up with something completely different.

JW function is stateful??

mathguy, August 23, 2023 - 5:05 am UTC

I doubt it'll be possible to build a fast refreshable MV though. The OP will probably have to make a DIY MV and build something to process matches as they're inserted or come up with something completely different.

I tried to build a fast refreshable MV as I described; it didn't work, you were right. DBMS_MVIEW.EXPLAIN_MVIEW says that the reason is "MV references PL/SQL function that maintains state". I have no idea why the Oracle implementation of JW distance and similarity needs to maintain state - that makes no sense.

If I were the OP, I would probably implement JW directly, and still use it in a MV with fast refresh.

It is also not clear to me why the OP needs JW similarity in his use case (rather than the plain JW function), but that is minor.

August 23, 2023 - 1:10 pm UTC

If the function uses any package variables this counts as "maintaining state", so it's probable the JW does for some reason. Which may or may not be for actually maintaining state across calls.

23c adds matching functions as native SQL operators, which may solve this issue.

This is just the first of many issues with fast refresh for this query though. For example, it uses ROW_NUMBER - window functions are disallowed in fast refreshes. This is because adding/removing a row could require recalculating the whole data set. Which kinda defeats the point of fast refresh!

ROW_NUMBER not in the MV

mathguy, August 23, 2023 - 1:57 pm UTC

@Chris
For example, it uses ROW_NUMBER - window functions are disallowed in fast refreshes.

That is not what I suggested. I said explicitly: The MV computes the JW scores, then filters on something like score >= 97. Assuming the cardinality of the result is reasonable (say, in the hundreds of thousands rather than tens of millions), an additional view on top of the MV can do the ROW_NUMBER business, that should not be the performance killer. That can be a plain view, not a materialized one - or it can be an MV with full refresh on top of the first MV, if that works.

There is no reason for such an MV (to just compute the JW scores and filter on them - same as "joining on the JW score condition") to not be fast refreshable. There is no reason for any of the UTL_MATCH functions to use package variables. They may use package constants, but those should be declared as such (and PL/SQL and the MV advisor should not view package constants as "maintaining state"). Alas, the functions are defined in the kernel, so there's no chance (for me anyway) to see what was actually coded...

August 29, 2023 - 3:49 pm UTC

Sorry I misunderstood - I was referring to the original query still.

Given that 23c adds native operators for the matching functions, the chances of UTL_MATCH changing are extremely low. You can use these for fast refresh (using the tables above):

create materialized view log on t1
  with rowid ( addr )
  including new values;
  
create materialized view log on t2
  with rowid ( address )
  including new values;
 
create materialized view mv as
select
  m_data.rowid mrid,
  r_data.rowid rrid,
  m_data.id                                                       m_id,
  r_data.addr                                                     r_addr,
  m_data.address                                                  m_addr
from  t2 m_data
join  t1 r_data
on fuzzy_match(jaro_winkler, m_data.address, r_data.addr) >= 97;

delete  MV_CAPABILITIES_TABLE;
  
exec dbms_mview.explain_mview('mv');
select capability_name , possible
from   MV_CAPABILITIES_TABLE
where  capability_name like 'REFRESH%';

CAPABILITY_NAME                                                                                                                  P
-------------------------------------------------------------------------------------------------------------------------------- -
REFRESH_COMPLETE                                                                                                                 Y
REFRESH_FAST                                                                                                                     Y
REFRESH_FAST_AFTER_INSERT                                                                                                        Y
REFRESH_FAST_AFTER_ONETAB_DML                                                                                                    Y
REFRESH_FAST_AFTER_ANY_DML                                                                                                       Y
REFRESH_FAST_PCT                                                                                                                 N

Is there a bug in the Oracle implementation of Jaro-Winkler?

mathguy, August 27, 2023 - 5:36 pm UTC

Disclaimer: I am not a paying customer of Oracle. (I don't work in the field, I am a retired mathematician and financial advisor; I use Oracle db for learning, purely for my entertainment, on my home computer.) So I do not have access to things like MOS to check on bugs. Still, I found no trace of what I am reporting here, anywhere I do have access to.

Following up on this thread - I built my own implementation of Jaro-Winkler similarity. I compared to Oracle's computations, and some results agree, others don't. Trying to find my mistakes in order to fix them, I found that, instead, Oracle's implementation seems to be wrong. Asking, first, whether I'm right (that there is a bug in Oracle's implementation); second, is it known (assuming it is indeed a bug), and third, if it is a bug and it is known, whether there are plans to fix it.

In the discussion below I will make no reference to my implementation; I will compare the Oracle function directly to the theoretical definition.

Suppose we have two strings STR1 of length LEN1 > 0 and STR2 of length LEN2 > 0. The Jaro definition computes "matches" M and "transpositions" T and computes a measure of similarity

JS = (M/LEN1 + M/LEN2 + (M - T)/M) / 3

Matches are computed as follows:
Define M_DIST as trunc(greatest(LEN1, LEN2)/2) - 1. Designate all characters in both strings as "unassigned" - this will be altered in the algorithm, as follows:
Loop over the characters of STR1 from left to right. For each such character, inspect all UNASSIGNED characters of STR2 in order from left to right, at positions between P - M_DIST and P + M_DIST (limited by 1 and LEN2 respectively), where P is the position of the character in STR1. As soon as we find an unassigned character in STR2 (within this range of positions) that matches the character from STR1, stop. Designate both the character from STR1 and the character from STR2 as "assigned" and move to the next character in STR1. If no match in STR2 is found, then the character in STR1 remains unassigned, and we move on to the next character in STR1.

This describes the "matching" part of the definition. The number of matches, M, is defined as the number of assigned characters in STR1 (equal to the number of assigned characters in STR2). So far, I believe the Oracle computation is correct.

Then the "transpositions" T:
If M = 0, there is no further calculation; the Jaro similarity is 0
Now assume M > 0. Remove all unassigned characters from both strings; keep the M assigned characters from STR1 in the order they were in the original string, and the same for STR2. Now compare the assigned characters, position by position: first against first, second against second, etc. The number of "transpositions" T is the number of positions where the corresponding characters are different, divided by 2.

Something seems to be wrong in this part of the calculation in the Oracle JARO_WINKLER function in package UTL_MATCH.

Before I show a few examples - the "Winkler" part of Jaro-Winkler has to do with an adjustment for strings that have the same few characters at the left end. In all my examples the very first character is different in all cases (on purpose - so I don't have to consider the Winkler contribution). All the computations are really about the original Jaro similarity. Just keep in mind that in all my examples below, the Winkler term is always 0.

Simplest example:

select utl_match.jaro_winkler('ABCxxx', 'BCAzzz') as js from dual;

                JS
------------------
0.5555555555555555

Here LEN1 = LEN2 = 6, M_DIST = trunc(6/2) - 1 = 2, M = 3, the assigned characters are ABC and BCA respectively, and T = 3/2. (All three pairs of assigned characters, aligned by position, are different.) Plugging into the formula for JS:

JS = (3/6 + 3/6 + (3 - 1.5)/3) / 3 = 0.5

This is not what the Oracle computation shows. Instead, it seems to use a value of 1 for transpositions (T = 1):

 (3/6 + 3/6 + (3 - 1)/3) / 3 = 5/9 = 0.55555555...

There are a few ways this can be explained. Perhaps someone (the programmer, or rather whoever told the programmer what to code) thought that the value T should be the truncated value of "count of misaligned assigned positions, divided by two". That is NOT the correct definition, but perhaps someone misunderstood it. Or, it is possible that the code is written in C, and someone forgot that dividing two integers always results in this kind of truncation; in C, one should divide by 2.0, not by 2. (Common rookie mistake, but one would hope that programmers writing code for Oracle are not rookies.)

But none of these explanations work for the next example:

with
  t (str1, str2) as (
    select 'xxxxxxABC' , 'zzzzzzzzBCA'  from dual union all
    select 'xxxxxxABCD', 'zzzzzzzzBCAD' from dual
  )
select str1, str2, utl_match.jaro_winkler(str1, str2) as js from t;

STR1        STR2                         JS
----------  ------------  -----------------
xxxxxxABC   zzzzzzzzBCA   0.424242424242424
xxxxxxABCD  zzzzzzzzBCAD  0.411111111111111

For the first pair of strings, the correct value (computed directly from definitions) is 0.3686868686 - here the Oracle JS value can be reconciled, again, by using a value T = 1 instead of the correct value T = 1.5 (same as above).

But for the second pair of strings, the correct value is GREATER than what Oracle calculates; it should be 0.452777777. Here the Oracle value can be reconciled by using a value of T = 2 instead of T = 1.5.

So the bug (assuming that's what it is) is more subtle than "integer division in C".

Note in particular, in the last example, that by adding a matching D at the end of both strings, from the first row to the second, the measure of similarity went DOWN. That can't be correct for ANY kind of "fuzzy matching".

Comment about those xx... and zz... I added to all the strings: that is so M_DIST is large enough to allow all the characters I want to be "matching" to be within the matching window in all cases.

September 04, 2023 - 2:38 am UTC

I make no claim to be an expert in the algorithm, but if I fire up the same in python

>>> import jaro
>>> jaro.jaro_winkler_metric(u'ABCxxx', u'BCAzzz')
0.5555555555555555
>>> jaro.jaro_winkler_metric(u'xxxxxxABC', u'zzzzzzzzBCA')
0.42424242424242414

so sorry, but I'm siding with Oracle/python on that one :-)

However, I do see a discrepancy for the last test case

>>> jaro.jaro_winkler_metric(u'xxxxxxABCD', u'zzzzzzzzBCAD')
0.49444444444444446

so I'll chase up that internally.

INDEX TO SPEED UP QUERY IN VIEW

Barbara Boehmer, August 29, 2023 - 8:13 am UTC

You could use a context index with a fuzzy search. The ? mark is the old contains query operator for fuzzy. In the example below, using the previously supplied data, it shows that it can use the domain index to narrow the search results to a smaller data set that the jaro_winkler_similarity can then be applied to. I used 50 for the fuzzy limit, but you can set it higher or lower.

C##SCOTT@XE_21.3.0.0.0> CREATE INDEX t2_id_ctx ON t2 (address)
2 INDEXTYPE IS CTXSYS.CONTEXT
3 /

Index created.

C##SCOTT@XE_21.3.0.0.0> CREATE VIEW test_view
2 AS
3 SELECT m_id, r_addr, m_addr, fuzzy_score, address_similarity
4 FROM (SELECT t2.id AS m_id,
5 t1.addr AS r_addr,
6 t2.address AS m_addr,
7 UTL_MATCH.JARO_WINKLER_SIMILARITY (t2.address, t1.addr)
8 AS address_similarity,
9 SCORE(1) AS fuzzy_score,
10 ROW_NUMBER () OVER
11 (PARTITION BY t2.id, t2.address
12 ORDER BY
13 UTL_MATCH.JARO_WINKLER_SIMILARITY (t2.address, t1.addr) DESC,
14 SCORE(1) DESC)
15 AS rn
16 FROM t1, t2
17 WHERE CONTAINS
18 (t2.address,
19 '?' ||
20 REPLACE (REPLACE (REPLACE (REPLACE (t1.addr, ',', ''), '?', ''), '-', ' '), ' ', ',?'),
21 1) > 50)
22 WHERE rn = 1
23 AND address_similarity > 97
24 /

View created.

C##SCOTT@XE_21.3.0.0.0> SET AUTOTRACE ON EXPLAIN
C##SCOTT@XE_21.3.0.0.0> SELECT * FROM test_view
2 /

M_ID R_ADDR M_ADDR FUZZY_SCORE ADDRESS_SIMILARITY
---------- ---------------------------------- ---------------------------------- ----------- ------------------
100 9 Help Street, Level 4 9 Help Street, Level 3 61 98
130 30 Hasivim St.,Petah-Tikva 30 Hasivim St.,Petah-Tikva 77 100
140 2 Jakaranda St 2 Jakaranda St 69 100
150 61, Science Park Rd 61, Science Park Rd 76 100
160 61, Social park road 61, Social park road 76 100
170 Av. Hermanos Escobar 5756 Av. Hermanos Escobar 5756 76 100
180 Ave. Hermanos Escobar 5756 Ave. Hermanos Escobar 57566 51 99
200 8600 MEMORIAL PKWY SW 8600 MEMORIAL PKWY SW 76 100
210 8200 FLORISSANTMEMORIALWAYABOVE SW 8200 FLORISSANTMEMORIALWAYABOVE SW 69 100
220 8600 MEMORIALFLORISSANT PKWY SW 8600 MEMORIALFLORISSANT PKWY SW. 76 99

10 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 337742161

--------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 90 | 59 (2)| 00:00:01 |
|* 1 | VIEW | | 1 | 90 | 59 (2)| 00:00:01 |
|* 2 | WINDOW SORT PUSHED RANK | | 1 | 52 | 59 (2)| 00:00:01 |
| 3 | NESTED LOOPS | | 1 | 52 | 58 (0)| 00:00:01 |
| 4 | TABLE ACCESS FULL | T1 | 13 | 312 | 3 (0)| 00:00:01 |
| 5 | TABLE ACCESS BY INDEX ROWID| T2 | 1 | 28 | 58 (0)| 00:00:01 |
|* 6 | DOMAIN INDEX | T2_ID_CTX | | | 4 (0)| 00:00:01 |
--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("RN"=1 AND "ADDRESS_SIMILARITY">97)
2 - filter(ROW_NUMBER() OVER ( PARTITION BY "T2"."ID","T2"."ADDRESS" ORDER BY
"UTL_MATCH"."JARO_WINKLER_SIMILARITY"("T2"."ADDRESS","T1"."ADDR") DESC
,"CTXSYS"."SCORE"(1) DESC )<=1)
6 - access("CTXSYS"."CONTAINS"("T2"."ADDRESS",'?'||REPLACE(REPLACE(REPLACE(REPLAC
E("T1"."ADDR",',',''),'?',''),'-',' '),' ',',?'),1)>50)

repost of code with tags

Barbara Boehmer, August 29, 2023 - 8:18 am UTC

I don't see any way to edit my post and I missed the code tags, so I will post the code again here.

C##SCOTT@XE_21.3.0.0.0> CREATE INDEX t2_id_ctx ON t2 (address)
  2  INDEXTYPE IS CTXSYS.CONTEXT
  3  /

Index created.

C##SCOTT@XE_21.3.0.0.0> CREATE VIEW test_view
  2  AS
  3  SELECT m_id, r_addr, m_addr, fuzzy_score, address_similarity
  4  FROM   (SELECT t2.id      AS m_id,
  5        t1.addr    AS r_addr,
  6        t2.address AS m_addr,
  7        UTL_MATCH.JARO_WINKLER_SIMILARITY (t2.address, t1.addr)
  8            AS address_similarity,
  9        SCORE(1)   AS fuzzy_score,
 10        ROW_NUMBER () OVER
 11          (PARTITION BY t2.id, t2.address
 12           ORDER BY
 13           UTL_MATCH.JARO_WINKLER_SIMILARITY (t2.address, t1.addr) DESC,
 14           SCORE(1) DESC)
 15            AS rn
 16        FROM   t1, t2
 17        WHERE  CONTAINS
 18          (t2.address,
 19           '?' ||
 20     REPLACE (REPLACE (REPLACE (REPLACE (t1.addr, ',', ''), '?', ''), '-', ' '), ' ', ',?'),
 21           1) > 50)
 22  WHERE  rn = 1
 23  AND    address_similarity > 97
 24  /

View created.

C##SCOTT@XE_21.3.0.0.0> SET AUTOTRACE ON EXPLAIN
C##SCOTT@XE_21.3.0.0.0> SELECT * FROM test_view
  2  /

      M_ID R_ADDR                             M_ADDR                             FUZZY_SCORE ADDRESS_SIMILARITY                   
---------- ---------------------------------- ---------------------------------- ----------- ------------------                   
       100 9 Help Street, Level 4             9 Help Street, Level 3                      61                 98                   
       130 30 Hasivim St.,Petah-Tikva         30 Hasivim St.,Petah-Tikva                  77                100                   
       140 2 Jakaranda St                     2 Jakaranda St                              69                100                   
       150 61, Science Park Rd                61, Science Park Rd                         76                100                   
       160 61, Social park road               61, Social park road                        76                100                   
       170 Av. Hermanos Escobar 5756          Av. Hermanos Escobar 5756                   76                100                   
       180 Ave. Hermanos Escobar 5756         Ave. Hermanos Escobar 57566                 51                 99                   
       200 8600 MEMORIAL PKWY SW              8600 MEMORIAL PKWY SW                       76                100                   
       210 8200 FLORISSANTMEMORIALWAYABOVE SW 8200 FLORISSANTMEMORIALWAYABOVE SW          69                100                   
       220 8600 MEMORIALFLORISSANT PKWY SW    8600 MEMORIALFLORISSANT PKWY SW.            76                 99                   

10 rows selected.


Execution Plan
----------------------------------------------------------                                                                        
Plan hash value: 337742161                                                                                                        
                                                                                                                                  
--------------------------------------------------------------------------------------------                                      
| Id  | Operation                      | Name      | Rows  | Bytes | Cost (%CPU)| Time     |                                      
--------------------------------------------------------------------------------------------                                      
|   0 | SELECT STATEMENT               |           |     1 |    90 |    59   (2)| 00:00:01 |                                      
|*  1 |  VIEW                          |           |     1 |    90 |    59   (2)| 00:00:01 |                                      
|*  2 |   WINDOW SORT PUSHED RANK      |           |     1 |    52 |    59   (2)| 00:00:01 |                                      
|   3 |    NESTED LOOPS                |           |     1 |    52 |    58   (0)| 00:00:01 |                                      
|   4 |     TABLE ACCESS FULL          | T1        |    13 |   312 |     3   (0)| 00:00:01 |                                      
|   5 |     TABLE ACCESS BY INDEX ROWID| T2        |     1 |    28 |    58   (0)| 00:00:01 |                                      
|*  6 |      DOMAIN INDEX              | T2_ID_CTX |       |       |     4   (0)| 00:00:01 |                                      
--------------------------------------------------------------------------------------------                                      
                                                                                                                                  
Predicate Information (identified by operation id):                                                                               
---------------------------------------------------                                                                               
                                                                                                                                  
   1 - filter("RN"=1 AND "ADDRESS_SIMILARITY">97)                                                                                 
   2 - filter(ROW_NUMBER() OVER ( PARTITION BY "T2"."ID","T2"."ADDRESS" ORDER BY                                                  
              "UTL_MATCH"."JARO_WINKLER_SIMILARITY"("T2"."ADDRESS","T1"."ADDR") DESC                                              
              ,"CTXSYS"."SCORE"(1) DESC )<=1)                                                                                     
   6 - access("CTXSYS"."CONTAINS"("T2"."ADDRESS",'?'||REPLACE(REPLACE(REPLACE(REPLAC                                              
              E("T1"."ADDR",',',''),'?',''),'-',' '),' ',',?'),1)>50)

August 30, 2023 - 1:21 pm UTC

That looks better :)

Good suggestion - worth looking into to see if this helps enough.

Same Jaro-Winkler bug(s) alive and well in Oracle 23

mathguy, September 02, 2023 - 3:41 pm UTC

It took me a few days to install Oracle 23 to test this. I had to upgrade Oracle Linux from 7.9 to 8.8 in the process... and I am not a professional.

The same bug persists in Oracle 23, with the FUZZY_MATCH function. The function has been made "native" but it's very likely the same code as UTL_MATCH was using.

I wonder how much reliance there is in the Oracle universe on the accuracy of the Jaro-Winkler calculation. For example, I hope that the U.S. Census Bureau does not use it - in an Oracle DB environment - for production work. (The function was originally created for census work.)

Since it's no longer about an old package (UTL_MATCH) that may not be maintained anymore, but about a function that has just been made native, I wonder what, if anything, Oracle will do about this. Alas I am not a paying customer, so I can't log bugs.

From SQL*Plus:

select fuzzy_match(jaro_winkler, 'ABCxxx', 'BCAzzz') from dual;

FUZZY_MATCH(JARO_WINKLER,'ABCXXX','BCAZZZ')
-------------------------------------------
        55

I showed earlier that the correct result is 50.

with
  t (str1, str2) as (
    select 'xxxxxxABC' , 'zzzzzzzzBCA'  from dual union all
    select 'xxxxxxABCD', 'zzzzzzzzBCAD' from dual
  )
select str1, str2, fuzzy_match(jaro_winkler, str1, str2) from t;

STR1       STR2         FUZZY_MATCH(JARO_WINKLER,STR1,STR2)
---------- ------------ -----------------------------------
xxxxxxABC  zzzzzzzzBCA        42
xxxxxxABCD zzzzzzzzBCAD       41

The correct results are 37 and 45, respectively; and, there is no definition of "similarity" by which the second pair of strings are LESS similar than the first pair.

Jaro similarity implementation

mathguy, September 04, 2023 - 3:18 am UTC

@Connor - Thank you for taking an interest in this.

I make no claim to be an expert in the algorithm, but if I fire up the same in python [...]

I am definitely not an expert either, but I have done more digging too and I think I am coming closer to the bottom of it.

Matt Jaro has written about his algorithm at least since 1978. I found a manual for his product, called UNIMATCH - in it he describes his algorithm in detail.

The issue is a division by two - there is a count of "transpositions" that is then divided by 2. In his manual, there is no mention of "integer division" or "floating point division"; the way he describes it, it should be float, but I don't know what his product (from 1978) did.

However, I found a C++ implementation - in the public domain - of a program written by folks at the US Census Bureau "using the approach of Jaro and Winkler", from around 1995, called STRCMP95. In it, they do clearly use integer division - which is, I assume, what Oracle, python and others followed.

In his writings, Winkler explicitly writes 0.5 * TRANSPOSITIONS in some places - so he clearly understood Jaro's algo to use floating point division by 2, not integer division.

Anyway, this is not (just) an Oracle thing, obviously. And it's not a fight anyone is going to win (on the side of floating point), nor one I have a particular interest in. Rightly or wrongly, it seems the Census Bureau uses integer division. For what it's worth, they had another mistake in the first version of STRCMP95 - also because they didn't understand Jaro's algo - but that was caught and fixed (and specifically acknowledged) in the second version.

The discrepancy you found with python is quite different. I know what the Oracle function does there (I explained it in this thread) but I have no idea why.

September 06, 2023 - 4:24 am UTC

If I get anything back from the devs, I'll post an update here

More to Explore

Analytics

Analytic SQL got you confused? Check out Connor McDonald's complete video course.

Breadcrumb

Question and Answer

You Asked

and Chris said...

Rating

Comments

Similarity computation itself is the (or at least "a major") performance killer

JW function is stateful??

ROW_NUMBER not in the MV

Is there a bug in the Oracle implementation of Jaro-Winkler?

INDEX TO SPEED UP QUERY IN VIEW

repost of code with tags

Same Jaro-Winkler bug(s) alive and well in Oracle 23

Jaro similarity implementation

More to Explore

Analytics

Performace issue with sql in the view using utl_match.jaro_winkler_similarity

Breadcrumb

Question and Answer

You Asked

and Chris said...

Rating

Comments

Similarity computation itself is the (or at least "a major") performance killer

JW function is stateful??

ROW_NUMBER not in the MV

Is there a bug in the Oracle implementation of Jaro-Winkler?

INDEX TO SPEED UP QUERY IN VIEW

repost of code with tags

Same Jaro-Winkler bug(s) alive and well in Oracle 23

Jaro similarity implementation

More to Explore

Analytics