Disclaimer: I am not a paying customer of Oracle. (I don't work in the field, I am a retired mathematician and financial advisor; I use Oracle db for learning, purely for my entertainment, on my home computer.) So I do not have access to things like MOS to check on bugs. Still, I found no trace of what I am reporting here, anywhere I do have access to.
Following up on this thread - I built my own implementation of Jaro-Winkler similarity. I compared to Oracle's computations, and some results agree, others don't. Trying to find my mistakes in order to fix them, I found that, instead, Oracle's implementation seems to be wrong. Asking, first, whether I'm right (that there is a bug in Oracle's implementation); second, is it known (assuming it is indeed a bug), and third, if it is a bug and it is known, whether there are plans to fix it.
In the discussion below I will make no reference to my implementation; I will compare the Oracle function directly to the theoretical definition.
Suppose we have two strings STR1 of length LEN1 > 0 and STR2 of length LEN2 > 0. The Jaro definition computes "matches" M and "transpositions" T and computes a measure of similarity
JS = (M/LEN1 + M/LEN2 + (M - T)/M) / 3
Matches are computed as follows:
Define M_DIST as trunc(greatest(LEN1, LEN2)/2) - 1. Designate all characters in both strings as "unassigned" - this will be altered in the algorithm, as follows:
Loop over the characters of STR1 from left to right. For each such character, inspect all UNASSIGNED characters of STR2 in order from left to right, at positions between P - M_DIST and P + M_DIST (limited by 1 and LEN2 respectively), where P is the position of the character in STR1. As soon as we find an unassigned character in STR2 (within this range of positions) that matches the character from STR1, stop. Designate both the character from STR1 and the character from STR2 as "assigned" and move to the next character in STR1. If no match in STR2 is found, then the character in STR1 remains unassigned, and we move on to the next character in STR1.
This describes the "matching" part of the definition. The number of matches, M, is defined as the number of assigned characters in STR1 (equal to the number of assigned characters in STR2). So far, I believe the Oracle computation is correct.
Then the "transpositions" T:
If M = 0, there is no further calculation; the Jaro similarity is 0
Now assume M > 0. Remove all unassigned characters from both strings; keep the M assigned characters from STR1 in the order they were in the original string, and the same for STR2. Now compare the assigned characters, position by position: first against first, second against second, etc. The number of "transpositions" T is the number of positions where the corresponding characters are different, divided by 2.
Something seems to be wrong in this part of the calculation in the Oracle JARO_WINKLER function in package UTL_MATCH.
Before I show a few examples - the "Winkler" part of Jaro-Winkler has to do with an adjustment for strings that have the same few characters at the left end. In all my examples the very first character is different in all cases (on purpose - so I don't have to consider the Winkler contribution). All the computations are really about the original Jaro similarity. Just keep in mind that in all my examples below, the Winkler term is always 0.
Simplest example:
select utl_match.jaro_winkler('ABCxxx', 'BCAzzz') as js from dual;
JS
------------------
0.5555555555555555
Here LEN1 = LEN2 = 6, M_DIST = trunc(6/2) - 1 = 2, M = 3, the assigned characters are ABC and BCA respectively, and T = 3/2. (All three pairs of assigned characters, aligned by position, are different.) Plugging into the formula for JS:
JS = (3/6 + 3/6 + (3 - 1.5)/3) / 3 = 0.5
This is not what the Oracle computation shows. Instead, it seems to use a value of 1 for transpositions (T = 1):
(3/6 + 3/6 + (3 - 1)/3) / 3 = 5/9 = 0.55555555...
There are a few ways this can be explained. Perhaps someone (the programmer, or rather whoever told the programmer what to code) thought that the value T should be the truncated value of "count of misaligned assigned positions, divided by two". That is NOT the correct definition, but perhaps someone misunderstood it. Or, it is possible that the code is written in C, and someone forgot that dividing two integers always results in this kind of truncation; in C, one should divide by 2.0, not by 2. (Common rookie mistake, but one would hope that programmers writing code for Oracle are not rookies.)
But none of these explanations work for the next example:
with
t (str1, str2) as (
select 'xxxxxxABC' , 'zzzzzzzzBCA' from dual union all
select 'xxxxxxABCD', 'zzzzzzzzBCAD' from dual
)
select str1, str2, utl_match.jaro_winkler(str1, str2) as js from t;
STR1 STR2 JS
---------- ------------ -----------------
xxxxxxABC zzzzzzzzBCA 0.424242424242424
xxxxxxABCD zzzzzzzzBCAD 0.411111111111111
For the first pair of strings, the correct value (computed directly from definitions) is 0.3686868686 - here the Oracle JS value can be reconciled, again, by using a value T = 1 instead of the correct value T = 1.5 (same as above).
But for the second pair of strings, the correct value is GREATER than what Oracle calculates; it should be 0.452777777. Here the Oracle value can be reconciled by using a value of T = 2 instead of T = 1.5.
So the bug (assuming that's what it is) is more subtle than "integer division in C".
Note in particular, in the last example, that by adding a matching D at the end of both strings, from the first row to the second, the measure of similarity went DOWN. That can't be correct for ANY kind of "fuzzy matching".
Comment about those xx... and zz... I added to all the strings: that is so M_DIST is large enough to allow all the characters I want to be "matching" to be within the matching window in all cases.