Hi! This is (should be...) a trivial question for those who are familiar with Regular Expressions, I guess (and hope). I used them almost 25 years ago, and I remember I was comfortable with them at the time. Weird enough, no matter how hard I am struggling, I seem now unable to figure out a "simple" substitution for a couple of REGEXP_REPLACE situations.
1) A list of codes, separated by ', ' (comma + space) with integer numbers. I'd like to obtain only unique values, each one listed only once (list coming from a LISTAGG):
-- E.g. '1, 5, 5, 7, 7, 7, 10, 11, 11, 11, 15, 15, 16' => should get '1, 5, 7, 10, 11, 15, 16'
2) A list of strings, separated by ', ' (comma + space). I'd like to obtain only unique values, each one listed only once (list coming from a LISTAGG):
-- E.g. 'TOKEN ONE, TOKEN ONE, TOKEN ONE, TOKEN ONE, EXAMPLES, SECOND, SECOND, ANOTHER ONE, ANOTHER ONE, ANOTHER ONE, APPLES, APPLES, APPLES, APPLES, APPLES, BANANAS & PEERS' => would love to get 'TOKEN ONE, EXAMPLES, SECOND, ANOTHER ONE, APPLES, BANANAS & PEERS'
Both lists of values are contained in two fields of a table, and as I said, they've been created using a LISTAGG function while aggregating rows from other tables - and unfortunately LISTAGG doesn't remove duplicates (BTW, is it there a way to do that with LISTAGG?). Would love to have a REGEXP_REPLACE able to get the wanted result in a SQL expression.
Can anyone help? THANK YOU SO MUCH IN ADVANCE!!!
Claudio de Biasio
The regular expression:
([^,]+)(,\1)+
Will find duplicates. You can then replace them with the backreference to \1
This works on numbers and words:
with rws as (
select mod(rownum,3) x from dual connect by level <= 10
)
select listagg(x, ',') within group (order by x) full,
regexp_replace( listagg(x, ',') within group (order by x), '([^,]+)(,\1)+', '\1') de_duped
from rws;
with rws as (
select to_char(to_date(mod(rownum,3)+1, 'j'), 'jsp') x from dual connect by level <= 10
)
select listagg(x, ',') within group (order by x) full,
regexp_replace( listagg(x, ',') within group (order by x), '([^,]+)(,\1)+', '\1') de_duped
from rws;
Note you need to watch for spacing. The first string has no leading space. So it doesn't match the next entry if there's a duplicate. Because "TOKEN ONE" <> " TOKEN ONE":
set define off
select regexp_replace (
'TOKEN ONE, TOKEN ONE, TOKEN ONE, TOKEN ONE, EXAMPLES, SECOND, SECOND, ANOTHER ONE, ANOTHER ONE, ANOTHER ONE, APPLES, APPLES, APPLES, APPLES, APPLES, BANANAS & PEERS',
'([^,]+)(,\1)+', '\1'
) de_dup
from dual;
DE_DUP
TOKEN ONE, TOKEN ONE, EXAMPLES, SECOND, ANOTHER ONE, APPLES, BANANAS & PEERS
Trimming the spaces could lead to unexpected results:
set define off
select regexp_replace(
replace (
'TOKEN ONE, TOKEN ONE, TOKEN ONE, TOKEN ONE, EXAMPLES, SECOND, SECOND, ANOTHER ONE, ANOTHER ONE, ANOTHER ONE, APPLES, APPLES, APPLES, APPLES, APPLES, BANANAS & PEERS',
', ', ','
),
'([^,]+)(,\1)+', '\1'
) de_dup
from dual;
DE_DUP
TOKEN ONE,EXAMPLESECOND,SECOND,ANOTHER ONE,APPLES,BANANAS & PEERS
So you're better off adding a space at the start:
set define off
select regexp_replace (
' TOKEN ONE, TOKEN ONE, TOKEN ONE, TOKEN ONE, EXAMPLES, SECOND, SECOND, ANOTHER ONE, ANOTHER ONE, ANOTHER ONE, APPLES, APPLES, APPLES, APPLES, APPLES, BANANAS & PEERS',
'([^,]+)(,\1)+', '\1'
) de_dup
from dual;
DE_DUP
TOKEN ONE, EXAMPLES, SECOND, ANOTHER ONE, APPLES, BANANAS & PEERS