Skip to Main Content
  • Questions
  • Longest common substring - Ratcliff-Obershelp similarity algorithm

Breadcrumb

Question and Answer

Chris Saxon

Thanks for the question.

Asked: August 18, 2020 - 11:00 am UTC

Answered by: Chris Saxon - Last updated: August 24, 2020 - 5:28 pm UTC

Category: PL/SQL - Version: 11.2.0.4

Viewed 1000+ times

You Asked

Hi TOM,

Is there a way to make my code faster ?
I'm trying to find the longest common sub-string (short: lcs) using dynamic programming; in this respect, i'm building a 2D array with x rows and Y columns, where X - length of string1, Y = length of string2.
First, i initialize the array with 0; (can this be avoided ?)
Then, i start from top-left, one character (from string 1) at a time and each time there's a match with a character from string2, i put +1 in that cell
Here's a visual description:

https://imgur.com/a/gz8JTms

I'm already using PL/SQL native compilation.

Following this lcs function, i'm using Ratcliff-Obershelp similarity algorithm and i'm relly stucked at how to build a function iteratively that would use this "lcs" function; any help would be greatly appreciated.

Thanks!

with LiveSQL Test Case:

and we said...

I'm not familiar with the Ratcliff-Obershelp similarity algorithm, but here are some thoughts:

First, i initialize the array with 0; (can this be avoided ?)

Yes. You can cut the loops to initialize the matrix at the start.

Within the loop to test the length, if neither index variable is 1 (the else branch of i = 1 OR j = 1), get the previous index value if it exists. Otherwise set the current position to one.

Which looks something like this:

if p_matrix.exists ( i-1 ) and p_matrix ( i-1 ).exists ( j-1 ) then
  p_matrix ( i )( j ) := p_matrix ( i-1 )( j-1 ) + 1; 
else 
  p_matrix ( i )( j ) := 1;
end if;


You should be able to build in logic to break out of the loops earlier too. If the length of current position + length of remaining string <= max length found so far, the remaining characters can't make a string any longer than the one you've already found. So you can exit the loop at this point.

Exactly how much this saves you depends on the input strings. Worst case there's no matching characters, so you're still looping through every element of both for O ( n*m ) runtime.

You could short-circuit by checking there are at least some matching characters before the loops.

One way to do this is with translate. Pass the input strings as the first two arguments and an out-of-bounds character as the third (e.g. #). Replace this out-of-bounds character in the result.

select 
  translate ( 'WIKIMEDIA', 'REMEDYA', '#' ) tr1,
  replace ( translate ('WIKIMEDIA', 'REMEDYA', '#'), '#' ) reptr1,
  translate ( 'WIKIMEDIA', 'NOSUB', '#' ) tr2,
  replace ( translate ('WIKIMEDIA', 'NOSUB', '#' ), '#' ) reptr2
from dual;

TR1     REPTR1   TR2         REPTR2      
WIKII   WIKII    WIKIMEDIA   WIKIMEDIA   


If the length of the final string equals the length of the first input, you know there are no matching characters. So can bypass the loops altogether.

I'm sure there are other optimizations; others may have ideas that can help.

and you rated our response

  (17 ratings)

Reviews

Ratcliff-Obershelp

August 21, 2020 - 9:39 am UTC

Reviewer: Alex

Thanks Chris for the ideas to improve the lcs (longest common sub-string) algorithm, which is a per-requisite for the mentioned algorithm
Any thoughts on how to apply the function recursively ?
i.e. start from the input strings :
lcs('WIKIMEDIA', 'REMEDYA') = 'MED'

next we need to check the remaining chunks on either sides:
lcs('WIKI', 'RE') = ''

and
lcs ('IA', 'YA') = 'A'

which finally leads to:
lcs('I', 'Y') = ''

So a loop that exists whenever lcs() = '', but the tricky part is that the strings split in smaller strings, much like a nuclear fission reaction.
Thanks,

Chris Saxon

Followup  

August 21, 2020 - 10:59 am UTC

I'm not sure exactly what you're getting at - why do you want to do this recursively? Please clarify!

Setting myself a little challenge ;-)

August 21, 2020 - 9:57 am UTC

Reviewer: Kim Berg Hansen from Middelfart, Denmark

I won't have time until after the weekend, but I'm going to attempt to do this in pure SQL in the MODEL clause ;-)
Chris Saxon

Followup  

August 21, 2020 - 11:00 am UTC

Go Kim! I had a quick try with match_recognize, maybe you'll have more luck :)

Follow-up

August 21, 2020 - 1:00 pm UTC

Reviewer: Alex

@Chris: Yes, i need to do that (apply lcs) recursively, as this is what Ratcliff-Obershelp (aka Gestalt pattern matching) algorithm is all about.
Count how many chars in lcsuntil lcs returns for all strings.
Then the string similarity score of the 2 initial strings is :
2 * (lcs_1 + lcs_2 + .. + lcs_n)/ (sum of characters in the 2 strings combined) 

More info here:
https://en.wikipedia.org/wiki/Gestalt_Pattern_Matching
Chris Saxon

Followup  

August 21, 2020 - 2:08 pm UTC

I see...

Once you've found the LCS, you can get the precediing/following strings with a bit of SUBSTR/INSTR magic:

with strs as (
  select 'WIKIMEDIA' s from dual
)
  select substr ( s, 1, instr ( s, 'MED' ) - 1 ) st, 
         substr ( s, instr ( s, 'MED' ) + length ( 'MED' ) ) en
  from   strs;
  
ST     EN   
WIKI   IA  


And use these to call the LCS algorithm again on each starting/ending pair.

Using recursive subquery factoring for the LCS

August 21, 2020 - 4:15 pm UTC

Reviewer: Stew Ashton from France

This should be faster when there is a long common substring and slower when there is no match at all. It is better when "B" is the shorter string. "STRT" and "LEN" give you a head start for breaking off the unmatched bits from "B".

with input(a, b) as (
  select 'antidisestablishmentarianism', 'disentanglement' from dual
)
select strt, len, substr(b,strt,len) lcs
from input i,
lateral(select length(i.b) + 1 - level len from dual connect by length(i.b) + 1 - level > 0) s,
lateral(select level strt from dual connect by level + s.len - 1 <= length(i.b))
where instr(a, substr(b,strt,len)) > 0
and rownum = 1;

LCS
----
dise

Slight improvement

August 21, 2020 - 7:17 pm UTC

Reviewer: Stew Ashton from France

Actually, this solution calculates for both strings the starting position of the LCS, so I should have provided both. This should make it easier to extract the extra bits before and after the LCS.
with input(a, b) as (
  select 'antidisestablishmentarianism', 'disentanglement' from dual
)
select instr(a, substr(b,b_strt,len)) a_strt, b_strt, len, substr(b, b_strt, len) lcs
from input i,
lateral(select length(i.b) + 1 - level len from dual connect by length(i.b) + 1 - level > 0) s,
lateral(select level b_strt from dual connect by level + s.len - 1 <= length(i.b))
where instr(a, substr(b,b_strt,len)) > 0
and rownum = 1;

A_STRT  B_STRT  LEN  LCS
------  ------  ---  ----
5       1       4    dise

Follow-up

August 22, 2020 - 6:22 am UTC

Reviewer: Alex

@Stew: nice use of lateral!
@Chris:
Coming back to your suggestion,
with strs as (
  select 'WIKIMEDIA' s from dual
UNION
  select 'REMEDIYA' from dual
)
  select substr ( s, 1, instr ( s, 'MED' ) - 1 ) left_Str, 
         substr ( s, instr ( s, 'MED' ) + length ( 'MED' ) ) right_Str
  from   strs;

which outputs:

LEFT_STR RIGHT_STR
RE IYA
WIKI IA


Now to the part of making the recursive loop: i was thinking of calling
lcs((1, 1), (2,1))

and
lcs((1,2), (2,2))

You guys have a better idea ?

follow-up

August 22, 2020 - 9:22 am UTC

Reviewer: Alex

Something like this:
TYPE left_string is TABLE OF VARCHAR2 INDEX BY PLS_INTEGER;
TYPE right_string is TABLE OF VARCHAR2 INDEX BY PLS_INTEGER;
left_string(1) := 'WIKIMEDIA'; --string length: 9
right_string(1) := 'REMEDIYA'; --string length: 8
lcs(left_string(1), right_string(1)) := 'MEDI';

left_string(2) := 'WIKI';
right_string(2) := 'RE';

left_string(3) := 'IA';
right_string(3) := 'YA';

lcs(left_string(2), right_string(2)) := ''; --dead end
lcs(left_string(3), right_string(3)) := 'A';

left_string(4) := 'I';
left_string(4) := 'Y';

left_string(5) := ''; --right side of 'A' from 'IA', dead end, stop computing right_string(5).

lcs(left_string(5), right_string(5)) := ''; --last string chunks to compare, exit loop

--We now compute the length of all lcs: 4 + 1 = 5 and considering the length of the initial strings 17 (9 + 8),
--this makes the Ratcliff-Obershelp similarity score:
2 * 5 / 17 = 0.588


The challenge is to make this recursion in (pl)sql..

Attempt at full solution

August 22, 2020 - 12:31 pm UTC

Reviewer: Stew Ashton from France

create or replace function gestalt(
  l in varchar2,
  r in varchar2
) return number is
  function lcs (
    l in varchar2,
    r in varchar2
  ) return number is
    ll varchar2(4000);
    rl varchar2(4000);
    lr varchar2(4000);
    rr varchar2(4000);
  begin
    for rec in (
      select instr(l, substr(r,r_strt,len)) l_strt, r_strt, len
      from dual,
      lateral(select length(r) + 1 - level len from dual connect by length(r) + 1 - level > 0) s,
      lateral(select level r_strt from dual connect by level + s.len - 1 <= length(r)) e
      where instr(l, substr(r,e.r_strt,len)) > 0
      and rownum <= 1
    ) loop
      ll := substr(l, 1, rec.l_strt - 1);
      rl := substr(r, 1, rec.r_strt - 1);
      lr := substr(l, rec.l_strt + rec.len);
      rr := substr(r, rec.r_strt + rec.len);
      return rec.len + lcs(ll, rl) + lcs(lr, rr);
    end loop;
    return 0;
  end lcs;
begin
  return 2 * lcs(l, r) / ( length(l) + length(r) );
end gestalt;
/
select round(gestalt('WIKIMEDIA', 'REMEDIYA'),3) from dual;

0.588

old fashion way using recursive sql

August 22, 2020 - 3:46 pm UTC

Reviewer: Koen Lostrie from Belgium

WITH strings (string1, string2) as
(select 'antidisestablishmentarianism', 'disentanglement' from dual)
,string1_data(stringcount, stringlen, start_pos_counter, start_pos ) AS 
(
      SELECT (LENGTH(string1)/2)*(LENGTH(string1)+1) r, LENGTH(string1), LENGTH(string1), 1  FROM strings t
      UNION ALL
     SELECT stringcount-1
          , stringlen
          , case when start_pos >= least(stringlen,start_pos_counter) then start_pos_counter - 1 else start_pos_counter end
          , case when start_pos >= least(stringlen,start_pos_counter) then 1 else start_pos + 1 end 
       FROM string1_data WHERE stringcount > 1
),string1_occurences (substring) as 
(
  SELECT substr(string1,start_pos,stringlen - start_pos_counter + 1)  FROM string1_data cross join strings
),string2_data(stringcount, stringlen, start_pos_counter, start_pos ) AS 
(
      SELECT (LENGTH(string2)/2)*(LENGTH(string2)+1) r, LENGTH(string2), LENGTH(string2), 1  FROM strings t
      UNION ALL
     SELECT stringcount-1
          , stringlen
          , case when start_pos >= least(stringlen,start_pos_counter) then start_pos_counter - 1 else start_pos_counter end
          , case when start_pos >= least(stringlen,start_pos_counter) then 1 else start_pos + 1 end 
       FROM string2_data WHERE stringcount > 1
),string2_occurences (substring) as 
(
  SELECT substr(string2,start_pos,stringlen - start_pos_counter + 1)  FROM string2_data cross join strings
)
select str1.substring 
  from string1_occurences str1
  join string2_occurences str2 on str1.substring = str2.substring order by length(str1.substring) desc fetch first 1 rows only; 

Chris Saxon

Followup  

August 24, 2020 - 2:13 pm UTC

Good stuff Koen :)

Follow-up: many thanks

August 22, 2020 - 4:14 pm UTC

Reviewer: Alex

@Stew: you're amazing!
I was struggling with building a strings array with all lcs (iteratively, divide&conquer style):
https://livesql.oracle.com/apex/livesql/s/kj0kjr5p5o47fh3pohybch14w
But your gestalt function made it look trivial!

Thank you very much, fantastic community!

Follow-up - performance

August 22, 2020 - 6:03 pm UTC

Reviewer: Alex

Can i have one last wish ?:)
Using Stew's SQL approach/function i get to process 70K strings in ~20 sec on my box.
I am assuming this is due to PL/SQL - SQL context switch.
Ideally, the results should come in 1-3 seconds for 100K strings.
I'm thinking of a PL/SQL approach for the (recursive) lcs function could perform better.
Any thoughts ?

Progress reporting

August 23, 2020 - 4:43 am UTC

Reviewer: Alex

I'm comparing the PL/SQL and SQL version to compute Ratcliff-Obershelp string similarity.

Scenario:

min_threshold: 0.65 (empiric value), testing string similarity for 'PRIVIL' against 105314 rows (all_objects * 2).

(1) PL/SQL version
https://livesql.oracle.com/apex/livesql/s/kj3h8q55suglbbyy0edrnmieq
Result:
2 strings selected out of pool of [105314] in 13.42 seconds.

(2) SQL version
https://livesql.oracle.com/apex/livesql/s/kj3pcfcl4hlqh6db9cfr45mjf
Result:
ORA-00040: active time limit exceeded

Can i make the PL/SQL version drop to max 3 seconds for same workload (100K records) ? Or the SQL version (that is currently dropped by the resource manager before finishing) ?

Another recursive variant

August 23, 2020 - 10:53 am UTC

Reviewer: Iudith Mentzel from Israel

Borrowing the so elegant first step of Stew Ashton, here is another "divide and conquer" solution,
just for your fun.

WITH 
FUNCTION f_lcs (p_str1 IN VARCHAR2, p_str2 IN VARCHAR2) RETURN VARCHAR2
AS
    l_lcs  VARCHAR2(2000);
BEGIN
--
if p_str1 is null or p_str2 is null then
   return null;
end if;
--
select substr(i.b, strt.b_strt, slen.len) lcs
into l_lcs
from 
    ( select p_str1 a, p_str2 b from dual ) i,
     lateral(select length(i.b) + 1 - level len 
             from dual 
             connect by length(i.b) + 1 - level > 0
     ) slen,
     lateral(select level b_strt 
             from dual 
             connect by level + slen.len - 1 <= length(i.b)
     ) strt
where instr(i.a, substr(i.b,strt.b_strt,slen.len)) > 0
order by slen.len desc
fetch first 1 row only;
--
return l_lcs;
EXCEPTION
     WHEN NO_DATA_FOUND THEN
          RETURN NULL;
END;
--
input(str1, str2) as (
  select 'antidisestablishmentarianism', 'disentanglement' from dual
),
--
recurs(str1, str2, lcs, lvl) as (
  select str1, str2, f_lcs(str1, str2), 1
  from input
  union all
  select case d.rn
            when 1 then substr(r.str1, 1, instr(r.str1, r.lcs) - 1)
                   else substr(r.str1, instr(r.str1, r.lcs) + length(r.lcs))  
         end str1
       , case d.rn
            when 1 then substr(r.str2, 1, instr(r.str2, r.lcs) - 1)
                   else substr(r.str2, instr(r.str2, r.lcs) + length(r.lcs))  
         end str2
       , f_lcs(
         case d.rn
            when 1 then substr(r.str1, 1, instr(r.str1, r.lcs) - 1)
                   else substr(r.str1, instr(r.str1, r.lcs) + length(r.lcs))  
         end,
         case d.rn
            when 1 then substr(r.str2, 1, instr(r.str2, r.lcs) - 1)
                   else substr(r.str2, instr(r.str2, r.lcs) + length(r.lcs))  
         end
         )  lcs
       , r.lvl + 1
  from 
       recurs r,
     ( select rownum rn from dual connect by level <= 2 )  d
  where
       r.lcs is not null
)
--
select * 
from recurs
/

STR1    STR2  LCS LVL
----------------------------------------------------------
antidisestablishmentarianism disentanglement dise 1
anti     -    -  2
stablishmentarianism  ntanglement ment 2
stablish   ntangle  ta 3
arianism    -    -  3
s    n   -  4
blish    ngle  l 4
b    ng   -  5
ish    e   -  5

9 rows selected.



Still using a single SELECT, you can extend it to process several pairs by replacing the INPUT query.

Here the output shows the recursion step by step, for the final result you should aggregate to get a single row
for each input entry.
Cheers :)

Chris Saxon

Followup  

August 24, 2020 - 2:12 pm UTC

Nice stuff Iudith

Solution with fewer context switches

August 23, 2020 - 2:56 pm UTC

Reviewer: Stew Ashton from France

SQL> create table t as select object_name from all_objects;

Table T created.

SQL> ALTER SESSION SET PLSQL_CODE_TYPE='NATIVE';

Session altered.

SQL> create or replace function gestalt2(
  2    l in varchar2,
  3    r in varchar2
  4  ) return number is
  5    function lcs (
  6      l in varchar2,
  7      r in varchar2
  8    ) return number is
  9      l_start pls_integer := 0;
 10      r_start pls_integer := 0;
 11      lr_length pls_integer := least(length(l),length(r));
 12      ll varchar2(128);
 13      rl varchar2(128);
 14      lr varchar2(128);
 15      rr varchar2(128);
 16    begin
 17      if l is null or r is null then
 18        return 0;
 19      end if;
 20      <<len_loop>>
 21      loop
 22        l_start := 1;
 23        loop
 24          r_start := instr(r,substr(l,l_start,lr_length));
 25          exit len_loop when r_start > 0;
 26          exit when l_start = length(l) - lr_length + 1;
 27          l_start :=  l_start + 1;
 28        end loop;
 29        exit when lr_length = 1;
 30        lr_length := lr_length - 1;
 31      end loop;
 32      if r_start = 0 then
 33        return 0;
 34      else
 35        ll := substr(l, 1, l_start - 1);
 36        rl := substr(r, 1, r_start - 1);
 37        lr := substr(l, l_start + lr_length);
 38        rr := substr(r, r_start + lr_length);
 39        return lr_length + lcs(ll, rl) + lcs(lr, rr);
 40      end if;
 41    end lcs;
 42  begin
 43    return 2 * lcs(l, r) / ( length(l) + length(r) );
 44  end gestalt2;
 45  /

Function GESTALT2 compiled

SQL> set timing on
SQL> declare
  2    l_gestalt number;
  3  begin
  4    for rec in (select * from t) loop
  5      l_gestalt := gestalt2('PRIVIL', rec.object_name);
  6    end loop;
  7  end;
  8  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.709
Note: T has 80306 rows
Chris Saxon

Followup  

August 24, 2020 - 2:12 pm UTC

Great work Stew!

Follow-up

August 23, 2020 - 5:23 pm UTC

Reviewer: Alex

@Stew: Chapeau!

Because I promised a MODEL solution

August 24, 2020 - 2:56 pm UTC

Reviewer: Kim Berg Hansen from Middelfart, Denmark

It's ugly and slow, but here it is anyway ;-)

For one word-pair:

select
   w1, w2, 2 * lmx / (length(w1) + length(w2)) ro_score
from (
   select *
   from (
      select 'WIKIMEDIA' as w1, 'REMEDIYA' as w2 from dual
   )
   model
      partition by (w1, w2)
      dimension by (0 as i, 0 as p1, 0 as p2)
      measures (upper(w1) as m1, upper(w2) as m2, cast(null as varchar2(1)) as ch, 0 as mx, 0 as lmx, 0 as l1, 0 as l2)
      ignore nav
      rules sequential order iterate(10) until iteration_number >= l2[-1, 0, 0] (
         ch[iteration_number, for p1 from 1 to length(m1[iteration_number,0,0]) increment 1, 0] = substr(m1[cv(i),0,0], cv(p1), 1)
       , ch[iteration_number, 0, for p2 from 1 to length(m2[iteration_number,0,0]) increment 1] = substr(m2[cv(i),0,0], cv(p2), 1)
       , mx[iteration_number, for p1 from 1 to length(m1[iteration_number,0,0]) increment 1, for p2 from 1 to length(m2[iteration_number,0,0]) increment 1]
            = case ch[cv(i), cv(p1), 0] when ch[cv(i), 0, cv(p2)] then mx[cv(i), cv(p1)-1, cv(p2)-1] + 1 end
       , lmx[iteration_number, 0, 0] = max(mx)[cv(i), any, any]
       , l1[iteration_number, any, any] = case mx[cv(), cv(), cv()] when lmx[cv(i), 0 ,0] then cv(p1) end
       , l2[iteration_number, any, any] = case mx[cv(), cv(), cv()] when lmx[cv(i), 0 ,0] then cv(p2) end
       , l1[iteration_number, 0, 0] = min(l1)[cv(i), any, any]
       , l2[iteration_number, 0, 0] = min(l2)[cv(i), any, any]
       , lmx[-1, 0, 0] = lmx[-1, 0, 0] + lmx[iteration_number, 0, 0]
       , m1[l2[-1, 0, 0]+1, 0, 0] = case when lmx[iteration_number, 0, 0] > 0 then substr(m1[iteration_number, 0, 0], 1, l1[iteration_number, 0, 0] - lmx[iteration_number, 0, 0]) end
       , m2[l2[-1, 0, 0]+1, 0, 0] = case when lmx[iteration_number, 0, 0] > 0 then substr(m2[iteration_number, 0, 0], 1, l2[iteration_number, 0, 0] - lmx[iteration_number, 0, 0]) end
       , l2[-1, 0, 0] = l2[-1, 0, 0] + case when m1[l2[-1, 0, 0]+1, 0, 0] is not null and m2[l2[-1, 0, 0]+1, 0, 0] is not null then 1 else 0 end
       , m1[l2[-1, 0, 0]+1, 0, 0] = case when lmx[iteration_number, 0, 0] > 0 then substr(m1[iteration_number, 0, 0], l1[iteration_number, 0, 0] + 1) end
       , m2[l2[-1, 0, 0]+1, 0, 0] = case when lmx[iteration_number, 0, 0] > 0 then substr(m2[iteration_number, 0, 0], l2[iteration_number, 0, 0] + 1) end
       , l2[-1, 0, 0] = l2[-1, 0, 0] + case when m1[l2[-1, 0, 0]+1, 0, 0] is not null and m2[l2[-1, 0, 0]+1, 0, 0] is not null then 1 else 0 end
      )
)
where (i, p1, p2) = ((-1,0,0))
order by ro_score desc nulls last;


For about a thousand:

select
   w1, w2, 2 * lmx / (length(w1) + length(w2)) ro_score
from (
   select *
   from (
      select distinct 'PRIVIL' as w1, object_name as w2 from all_objects
      where rownum <= 1000
   )
   model
      partition by (w1, w2)
      dimension by (0 as i, 0 as p1, 0 as p2)
      measures (upper(w1) as m1, upper(w2) as m2, cast(null as varchar2(1)) as ch, 0 as mx, 0 as lmx, 0 as l1, 0 as l2)
      ignore nav
      rules sequential order iterate(10) until iteration_number >= l2[-1, 0, 0] (
         ch[iteration_number, for p1 from 1 to length(m1[iteration_number,0,0]) increment 1, 0] = substr(m1[cv(i),0,0], cv(p1), 1)
       , ch[iteration_number, 0, for p2 from 1 to length(m2[iteration_number,0,0]) increment 1] = substr(m2[cv(i),0,0], cv(p2), 1)
       , mx[iteration_number, for p1 from 1 to length(m1[iteration_number,0,0]) increment 1, for p2 from 1 to length(m2[iteration_number,0,0]) increment 1]
            = case ch[cv(i), cv(p1), 0] when ch[cv(i), 0, cv(p2)] then mx[cv(i), cv(p1)-1, cv(p2)-1] + 1 end
       , lmx[iteration_number, 0, 0] = max(mx)[cv(i), any, any]
       , l1[iteration_number, any, any] = case mx[cv(), cv(), cv()] when lmx[cv(i), 0 ,0] then cv(p1) end
       , l2[iteration_number, any, any] = case mx[cv(), cv(), cv()] when lmx[cv(i), 0 ,0] then cv(p2) end
       , l1[iteration_number, 0, 0] = min(l1)[cv(i), any, any]
       , l2[iteration_number, 0, 0] = min(l2)[cv(i), any, any]
       , lmx[-1, 0, 0] = lmx[-1, 0, 0] + lmx[iteration_number, 0, 0]
       , m1[l2[-1, 0, 0]+1, 0, 0] = case when lmx[iteration_number, 0, 0] > 0 then substr(m1[iteration_number, 0, 0], 1, l1[iteration_number, 0, 0] - lmx[iteration_number, 0, 0]) end
       , m2[l2[-1, 0, 0]+1, 0, 0] = case when lmx[iteration_number, 0, 0] > 0 then substr(m2[iteration_number, 0, 0], 1, l2[iteration_number, 0, 0] - lmx[iteration_number, 0, 0]) end
       , l2[-1, 0, 0] = l2[-1, 0, 0] + case when m1[l2[-1, 0, 0]+1, 0, 0] is not null and m2[l2[-1, 0, 0]+1, 0, 0] is not null then 1 else 0 end
       , m1[l2[-1, 0, 0]+1, 0, 0] = case when lmx[iteration_number, 0, 0] > 0 then substr(m1[iteration_number, 0, 0], l1[iteration_number, 0, 0] + 1) end
       , m2[l2[-1, 0, 0]+1, 0, 0] = case when lmx[iteration_number, 0, 0] > 0 then substr(m2[iteration_number, 0, 0], l2[iteration_number, 0, 0] + 1) end
       , l2[-1, 0, 0] = l2[-1, 0, 0] + case when m1[l2[-1, 0, 0]+1, 0, 0] is not null and m2[l2[-1, 0, 0]+1, 0, 0] is not null then 1 else 0 end
      )
)
where (i, p1, p2) = ((-1,0,0))
order by ro_score desc nulls last;


But takes a couple seconds for 988 word pairs for me, so definitely not better than Stew's solution.
And this is also a lot uglier and hard to understand - but was fun to make ;-)
Chris Saxon

Followup  

August 24, 2020 - 5:28 pm UTC

Glad you had fun :)

Apologies

October 22, 2020 - 4:16 pm UTC

Reviewer: Stew Ashton from France

I enjoyed spending an "Office Hour" on this topic with Chris and several attendees. While explaining the "pure PL/SQL" solution, I was thinking to myself "why is the looping logic so complicated???". Afterwards, I realized that plain old FOR loops could do the job just as well!

Anyway, here is a solution that puts the function in a WITH clause (as Iudith Wentzel did previously). The function is easier to understand. I do apologize for inflicting some unnecessary complications.
with function gestalt4(l in varchar2, r in varchar2) return number is

  function lcs (l in varchar2, r in varchar2) return number is
    l_start pls_integer;
    ll varchar2(128);
    rl varchar2(128);
    lr varchar2(128);
    rr varchar2(128);
  begin
    if l is null or r is null then
      return 0;
    end if;
    for lr_length in reverse 1..least(length(l),length(r)) loop
      for r_start in 1..length(r) - lr_length + 1 loop
        l_start := instr(l,substr(r,r_start,lr_length));
        if l_start > 0 then
          dbms_output.put_line(substr(r,r_start,lr_length));
          ll := substr(l, 1, l_start - 1);
          rl := substr(r, 1, r_start - 1);
          lr := substr(l, l_start + lr_length);
          rr := substr(r, r_start + lr_length);
          return lr_length + lcs(ll, rl) + lcs(lr, rr);
        end if;
      end loop;
    end loop;
    return 0;
  end lcs;

begin
  return 2 * lcs(l, r) / ( nvl(length(l),0) + nvl(length(r),0) );
end gestalt4;

select gestalt4('gestalt practice', 'gestalt pattern matching') diff from dual;
/

      DIFF
----------
0.6       

gestalt p
a
t
e

More to Explore

SQL

The Oracle documentation contains a complete SQL reference.