Skip to Main Content
  • Questions
  • Finding matching string patters in one column in another column

Breadcrumb

Question and Answer

Tom Kyte

Thanks for the question, Oliver.

Asked: August 09, 2013 - 7:26 pm UTC

Last updated: August 13, 2013 - 7:30 pm UTC

Version: 11.2.0.3

Viewed 10K+ times! This question is

You Asked

create table t (a varchar2(25), b varchar2(25))
/

insert into t values ('abc-doej-123', 'doe jane') ;
insert into t values ('abc-doej-123', 'doe mary') ;
insert into t values ('xyz-smithp', 'smith john') ;
insert into t values ('xyz-smithp', 'smith paul') ;
commit ;

My goal is to select out of the table the record whose column "b" value best matches the value of column "a" for each like value of column "a".

For example, if I strip out the space in column "b", 'doe jane' becomes 'doejane' and 'doe mary' becomes 'doemary'.

'doejane' has four consecutive characters that can be found in 'abc-doej-123' and 'doemary' has 3 consecutive characters that can be found in 'abc-doej-123',
This would make 'doe jane' the better match. Similarly 'smith paul' would be a better match for 'xyz-smithp' than 'smith john'.

I'm simply stuck coming up with a SQL Solution for this and would greatly appreciate your input.

Thanks for considering the problem!

and Tom said...

ops$tkyte%ORA11GR2> select a, max(b) keep (dense_rank first order by r2 DESC)
  2  from (
  3  select a, b, r2
  4    from (
  5  select t.a,
  6         t.b,
  7         to_number(substr(column_value,4) ) r2
  8    from (select a,
  9                 replace(b,' ','') b,
 10                 length(replace(b,' ','')) len
 11            from t) t,
 12          TABLE( cast( multiset(
 13            select to_char( ceil(level/len-.1) , 'fm000' ) ||
 14                            to_char( decode( mod(level,len), 0, len, mod(level,len) ), 'fm000' )
 15              from dual
 16             where ceil(level/len-.1) + decode( mod(level,len), 0, len, mod(level,len) ) <= len+1
 17           connect by level <= len*len
 18                               )
 19                   as sys.odciVarchar2List )
 20               )
 21   where instr( a, substr( b, to_number(substr(column_value,1,3)), to_number(substr(column_value,4)) ) ) > 0
 22   )
 23  )
 24  group by a
 25  /

A                         MAX(B)KEEP(DENSE_RANKFIRS
------------------------- -------------------------
abc-doej-123              doejane
xyz-smithp                smithpaul



is one approach...

on lines 12-17, we generate a set of number pairs based on the length of column b (after replacing spaces with nothing). Those number pairs for a string of length 3 would be:

x,y
1,1
1,2
1,3
2,1
2,2
2,3
3,1
3,2
3,3

for example. Of those, we only keep the pairs such that x+y <= lenght+1 - so we keep:

1,1
1,2
1,3
2,1
2,2
3,1

we use those at the next level of the view to substr all of the parts of B we can - starting from the 1 character for 1 character, then 1 for 2, 1 for 3, 2 for 1 and so on....

We only keep rows such that the substr is IN the first string A.

Once we have that - we have all of the matches - we just need to keep the max(b) after grouping by A and sorting by R2 desc (r2 is the length of the substr)...


hope that makes sense :)

and I cannot wait to see what others come up with!

Rating

  (8 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

another way

A reader, August 13, 2013 - 2:05 pm UTC

with recursive clause...


with rec(a,b,lvl,l, flg) as 
(
 select a, replace(b,' ',''), 1, length(replace(b,' ','')), instr(a,left(b,1),1)
 from t 

 union all

 select a, b, lvl+1, l, instr(a,substr(b,1,lvl+1),1)
 from rec
 where instr(a,left(b,lvl+1),1)>0 and lvl+1<l 
 
)
select * 
from
(
 select a, b, rank()over(partition by a  order by lvl desc) as rnk
 from
 (
  select * 
  from rec
 )x
)x 
where rnk=1


Tom Kyte
August 13, 2013 - 7:22 pm UTC

what is left?

Many Thanks

Oliver Ruecker, August 13, 2013 - 3:51 pm UTC

Thank you very much for your answers! Both responses are extremely helpful. For the recursive query, I simply needed to replace the LEFT functions with SUBSTR and it worked like a charm.
Tom Kyte
August 13, 2013 - 7:30 pm UTC

really?

ops$tkyte%ORA11GR2> with rec(a,b,lvl,l, flg) as
  2  (
  3      select a, replace(b,' ',''), 1, length(replace(b,' ','')),
  4  instr(a,substr(b,1),1)
  5      from t
  6      union all
  7      select a, b, lvl+1, l, instr(a,substr(b,1,lvl+1),1)
  8      from rec
  9      where instr(a,substr(b,lvl+1),1)>0 and lvl+1<l
 10  )
 11  select *
 12  from
 13  (
 14      select a, b, rank()over(partition by a  order by lvl desc) as rnk
 15      from
 16      (
 17          select *
 18          from rec
 19      )x
 20  )x
 21  where rnk=1
 22  /

A                         B                                RNK
------------------------- ------------------------- ----------
abc-doej-123              doejane                            1
abc-doej-123              doemary                            1
xyz-smithp                smithjohn                          1
xyz-smithp                smithpaul                          1




the recursive one did not seem to work - no clue what "left" was supposed to be - and substr doesn't cut it.

I don't follow the logic. The first "recurse" would give us:


ops$tkyte%ORA11GR2> select a, replace(b,' ','') b, 1 lvl, length(replace(b,' ','')) l, instr(a,substr(b,1),1) flg
  2    from t
  3  /

A                         B                                LVL          L        FLG
------------------------- ------------------------- ---------- ---------- ----------
abc-doej-123              doejane                            1          7          0
abc-doej-123              doemary                            1          7          0
xyz-smithp                smithjohn                          1          9          0
xyz-smithp                smithpaul                          1          9          0



and the "second" recurse would give us:

ops$tkyte%ORA11GR2>     select a, b, lvl+1, l, instr(a,substr(b,1,lvl+1),1)
  2      from
  3          (
  4  select a, replace(b,' ','') b, 1 lvl, length(replace(b,' ','')) l, instr(a,substr(b,1),1) flg
  5    from t
  6    ) rec
  7      where instr(a,substr(b,lvl+1),1)>0 and lvl+1<l
  8  /

no rows selected




so we get - nothing?

above user recursive query

A reader, August 13, 2013 - 11:58 pm UTC

sorry does not have oracle on hand, was run on db2 :)


SQL> select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - Production
PL/SQL Release 11.2.0.1.0 - Production
CORE    11.2.0.1.0      Production
TNS for 32-bit Windows: Version 11.2.0.1.0 - Production
NLSRTL Version 11.2.0.1.0 - Production

SQL> ed
Wrote file afiedt.buf

  1  with rec(a,b,lvl,l, flg) as
  2  (
  3      select a, replace(b,' ',''), 1, length(replace(b,' ','')), instr(a,substr(b,1,1),1)
  4      from t
  5      union all
  6      select a, b, lvl+1, l, instr(a,substr(b,1,lvl+1),1)
  7      from rec
  8      where instr(a,substr(b,1, lvl+1),1)>0 and lvl+1<l
  9  )
 10  select *
 11  from
 12  (
 13      select a, b, rank()over(partition by a  order by lvl desc) as rnk
 14      from
 15      (
 16          select *
 17          from rec
 18      )x
 19  )x
 20* where rnk=1
 21  /

A                         B                                RNK
------------------------- ------------------------- ----------
abc-doej-123              doejane                            1
xyz-smithp                smithpaul                          1

SQL>


Excellent!

Manjunath Hegde, August 14, 2013 - 11:52 am UTC

To A Reader:
Your idea of comparing all the substrings of b with col a and ranking is nice. I dont have 11g or DB2 in hand. my doubt is will it work for value 'doemdoej jane' in column b? I am asking because you are always starting from charactor 1 while taking substring. I have tested toms method and it worked for above value.

Another recursive solution

Stew Ashton, August 14, 2013 - 3:39 pm UTC


I added a bit more data to test other conditions.

Recursion here limits the number of iterations. I never "back up" when starting to test a longer string, and I stop testing as soon as the string length is too long.
drop table t;
create table t (a varchar2(25), b varchar2(25));

insert into t values ('abc-doej-123', 'doemdoe jane') ;
insert into t values ('abc-doej-123', 'doe mary') ;
insert into t values ('xyz-smithp', 'smith john') ;
insert into t values ('xyz-smithp', 'smith paul') ;
insert into t values ('abc-smithp', '1234smithp') ;
insert into t values ('abc-smithp', 'smithp c-smithp') ;
commit ; 

with tstart as (
  select a, replace(b,' ','') b, length(replace(b,' ','')) blen
  from t
), twith (a, b, blen, beg, len, rsult) as (
  select a, b, blen, 1 beg, 2 len, instr(a,substr(b,1,2)) rsult
  from tstart
  union all
  select o.a, o.b, o.blen,
    case when o.rsult > 0 then o.beg
    else o.beg+1 end beg,
    case when o.rsult > 0 then o.len+1
    else o.len end len,
    instr(o.a,substr(o.b,
      (case when o.rsult > 0 then o.beg else o.beg+1 end),
      (case when o.rsult > 0 then o.len+1 else o.len end)
    )) rsult
    from twith o join tstart n
    on (o.a, o.b) = ((n.a, n.b)) and o.beg + o.len <= o.blen
)
select a, max(b) keep (dense_rank first order by len desc) b
from twith
where rsult > 0
group by a;

A                         B                       
------------------------- -------------------------
abc-doej-123              doemdoejane               
abc-smithp                smithpc-smithp            
xyz-smithp                smithpaul

On second thought: fewer iterations

Stew Ashton, August 14, 2013 - 9:31 pm UTC


Here's a further experiment that seems to work: I keep the longest found length as I work my way through the different "b" columns for the same "a".
WITH tstart AS (
  SELECT A, REPLACE(b,' ','') b, LENGTH(REPLACE(b,' ','')) blen,
  row_number() OVER(PARTITION BY A ORDER BY LENGTH(REPLACE(b,' ','')) desc) rn
  FROM t
), twith (A, rn, b, blen, beg, len, rsult) AS (
  select a, rn, b, blen, 1 beg, 2 len, instr(a,substr(b,1,2)) rsult
  FROM tstart
  where rn = 1
  UNION ALL
  SELECT o.A, n.rn, n.b, n.blen,
    CASE when o.rn < n.rn then 1
      when o.rsult > 0 then o.beg
      else o.beg+1 end beg,
    case when o.rsult > 0 then o.len+1
    else o.len end len,
    instr(n.A,substr(n.b,
      (CASE when o.rn < n.rn then 1 when o.rsult > 0 then o.beg else o.beg+1 end),
      (case when o.rsult > 0 then o.len+1 else o.len end)
    )) rsult
    FROM twith o JOIN tstart n
    ON o.A = n.A AND (
      (o.rn = n.rn AND o.beg + o.len <= o.blen)
      OR (n.rn = o.rn + 1 AND o.beg + o.len > o.blen)
    )
)
select a, max(b) keep (dense_rank first order by len desc) b
from twith
WHERE rsult > 0
GROUP BY A
ORDER BY A;

A                         B                       
------------------------- -------------------------
abc-doej-123              doemdoejane               
abc-smithp                smithpc-smithp            
xyz-smithp                smithpaul

Nice

Manju, August 15, 2013 - 5:49 pm UTC

Anything is possible in SQL!!!

Oracle supplied package could help

Alex, August 28, 2013 - 12:56 pm UTC

Also the function UTL_MATCH.JARO_WINKLER_SIMILARITY might be helpful in this case. It calculates the measure of agreement between string-1 and string-2, returning a value between 0 (no match) and 100 (perfect match).

SELECT a, b, UTL_MATCH.JARO_WINKLER_SIMILARITY(a, b) similarity
from t
order by similarity desc;

A B SI
--------------- --------------- --
xyz-smithp smith paul 73
xyz-smithp smith john 66
abc-doej-123 doe jane 54
abc-doej-123 doe mary 44

More to Explore

Analytics

Analytic SQL got you confused? Check out Connor McDonald's complete video course.