I am looking for a generic way to return distinct values based on part of a column: Say for example, my result set is:
create table disks1 as
select 'emcpower0b' path from dual
union
select 'emcpower10b' from dual
union
select 'emcpower11b' from dual
union
select 'emcpower12b' from dual
/
select * from disks1 order by 1
/
PATH
-----------
emcpower0b
emcpower10b
emcpower11b
emcpower12b
select distinct substr(path, 1, 8) from disks1;
SUBSTR(P
--------
emcpower
What does every record have in common? The first 8 letters: emcpower is the common string in all records, but I would not like to use distinct substr( path, 1, 8 ), as in other cases, instead of 8, it could have any number of characters. I thought about removing the numbers, so it would give me a unique string, but that is not totally generic (think about sda, sdb, sdc, where there are no number and only two letters are common).
create table disks2 as
select 'sda' path from dual
union
select 'sdb' from dual
union
select 'sdc' from dual
/
select * from disks2
/
DIS
---
sda
sdb
sdc
|| In this case, the two initial letters are common to all records.
select distinct substr(path, 1, 2) from disks2;
SU
--
sd
What does make a string/column "distinct"? The n-initial common characters for all records.
Any ideas?
Thank you,
Marcelo
Thanks for providing a complete test case and scripts.
Here's an algorithm to get you started:
- Generate all the substrings starting from the first character (e.g. e, em, emc, ...)
- For each of these check whether there's another row with a different path, that's the same up to this point
- Find the maximum substring with a match from the previous step for each path
- Return the distinct set of these max substrings.
This returns the longest string where there is at least one other row with the same starting string.
emcpower1 is the longest string with a corresponding match. This algorithm includes emcpower and emcpower1. I'm not clear how you decide that it should be just emcpower and not emcpower1.
It's also not clear what you want when there's multiple strings which share just a common starting letter. e.g. if there's also "samba", do you want:
- S, because that's the longest common?
- SD, because you want there needs to be at least two common characters?
- S & SD?
This returns s & sd. You'll need to adjust the code below based on how you answer this and how you want to handle the emcpower duplication.
create table disks1 as
select 'emcpower0b' p from dual
union
select 'emcpower10b' from dual
union
select 'emcpower11b' from dual
union
select 'emcpower12b' from dual
union
select 'sda' from dual
union
select 'sdb' from dual
union
select 'sdc' from dual
union
select 'samba' from dual ;
with rws as (
-- generate rows up to the length of the longest string
select rownum r from dual
connect by level <= (select max(length(p)) from disks1)
)
select distinct s
from (
select p, max(subp) s -- get the longest common string per path
from (
select p,
-- create all the starting strings
substr(p, 1, rws.r) subp
from disks1 d1
-- cartesian product to generate all substrings
cross join rws
where r < length(p)
and exists (
-- verify there's another row which has the
-- same N starting characters
select * from disks1 d2
where d1.p <> d2.p
and substr(d1.p, 1, rws.r) = substr(d2.p, 1, rws.r)
)
)
group by p
)
order by 1;
S
-----------
emcpower
emcpower1
s
sd