Skip to Main Content
  • Questions
  • Distinct using just the common part of records

Breadcrumb

Question and Answer

Chris Saxon

Thanks for the question, Marcelo.

Asked: October 26, 2015 - 10:01 pm UTC

Last updated: October 28, 2015 - 6:26 pm UTC

Version: 11.2.0.4

Viewed 1000+ times

You Asked

I am looking for a generic way to return distinct values based on part of a column: Say for example, my result set is:

create table disks1 as
select 'emcpower0b' path from dual
union
select 'emcpower10b' from dual
union
select 'emcpower11b' from dual
union
select 'emcpower12b' from dual
/

select * from disks1 order by 1
/

PATH
-----------
emcpower0b
emcpower10b
emcpower11b
emcpower12b

select distinct substr(path, 1, 8) from disks1;

SUBSTR(P
--------
emcpower

What does every record have in common? The first 8 letters: emcpower is the common string in all records, but I would not like to use distinct substr( path, 1, 8 ), as in other cases, instead of 8, it could have any number of characters. I thought about removing the numbers, so it would give me a unique string, but that is not totally generic (think about sda, sdb, sdc, where there are no number and only two letters are common).

create table disks2 as
select 'sda' path from dual
union
select 'sdb' from dual
union
select 'sdc' from dual
/

select * from disks2
/

DIS
---
sda
sdb
sdc
|| In this case, the two initial letters are common to all records.

select distinct substr(path, 1, 2) from disks2;

SU
--
sd

What does make a string/column "distinct"? The n-initial common characters for all records.

Any ideas?

Thank you,
Marcelo

and Chris said...

Thanks for providing a complete test case and scripts.

Here's an algorithm to get you started:

- Generate all the substrings starting from the first character (e.g. e, em, emc, ...)
- For each of these check whether there's another row with a different path, that's the same up to this point
- Find the maximum substring with a match from the previous step for each path
- Return the distinct set of these max substrings.

This returns the longest string where there is at least one other row with the same starting string.

emcpower1 is the longest string with a corresponding match. This algorithm includes emcpower and emcpower1. I'm not clear how you decide that it should be just emcpower and not emcpower1.

It's also not clear what you want when there's multiple strings which share just a common starting letter. e.g. if there's also "samba", do you want:

- S, because that's the longest common?
- SD, because you want there needs to be at least two common characters?
- S & SD?

This returns s & sd. You'll need to adjust the code below based on how you answer this and how you want to handle the emcpower duplication.

create table disks1 as
  select 'emcpower0b' p from dual
  union
  select 'emcpower10b' from dual 
  union
  select 'emcpower11b' from dual
  union
  select 'emcpower12b' from dual
  union 
  select 'sda' from dual
  union
  select 'sdb' from dual
  union
  select 'sdc' from dual
  union
  select 'samba' from dual ;
  
with rws as (
  -- generate rows up to the length of the longest string
  select rownum r from dual
  connect by level <= (select max(length(p)) from disks1)
)
select distinct s 
from (
  select p, max(subp) s -- get the longest common string per path
  from (
    select p, 
           -- create all the starting strings 
           substr(p, 1, rws.r) subp
    from   disks1 d1
    -- cartesian product to generate all substrings
    cross join rws
    where  r < length(p)
    and    exists (
      -- verify there's another row which has the 
      -- same N starting characters
      select * from disks1 d2
      where  d1.p <> d2.p
      and    substr(d1.p, 1, rws.r) = substr(d2.p, 1, rws.r)
    )
  ) 
  group by p
)
order  by 1;

S         
-----------
emcpower   
emcpower1  
s          
sd    



Rating

  (1 rating)

Is this answer out of date? If it is, please let us know via a Comment

Comments

WOW

Marcelo, October 28, 2015 - 2:31 pm UTC

Amazing! Exactly what I need.

I will apply this query to ASM disk paths, and in my environment it is supposed to have just one common string - if there are more than one string (like in samba and sd), returning both, like you did, will show me that something is wrong.

AskTom is an invaluable repository. Thank you for keeping it up and running and thank you for the solution!!
Chris Saxon
October 28, 2015 - 6:26 pm UTC

Thanks for the feedback