Skip to Main Content

Breadcrumb

Question and Answer

Connor McDonald

Thanks for the question, Claudio.

Asked: October 21, 2016 - 8:55 am UTC

Last updated: February 07, 2020 - 9:39 am UTC

Version: Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production

Viewed 10K+ times! This question is

You Asked

Hi! This is (should be...) a trivial question for those who are familiar with Regular Expressions, I guess (and hope). I used them almost 25 years ago, and I remember I was comfortable with them at the time. Weird enough, no matter how hard I am struggling, I seem now unable to figure out a "simple" substitution for a couple of REGEXP_REPLACE situations.

1) A list of codes, separated by ', ' (comma + space) with integer numbers. I'd like to obtain only unique values, each one listed only once (list coming from a LISTAGG):

-- E.g. '1, 5, 5, 7, 7, 7, 10, 11, 11, 11, 15, 15, 16' => should get '1, 5, 7, 10, 11, 15, 16'

2) A list of strings, separated by ', ' (comma + space). I'd like to obtain only unique values, each one listed only once (list coming from a LISTAGG):

-- E.g. 'TOKEN ONE, TOKEN ONE, TOKEN ONE, TOKEN ONE, EXAMPLES, SECOND, SECOND, ANOTHER ONE, ANOTHER ONE, ANOTHER ONE, APPLES, APPLES, APPLES, APPLES, APPLES, BANANAS & PEERS' => would love to get 'TOKEN ONE, EXAMPLES, SECOND, ANOTHER ONE, APPLES, BANANAS & PEERS'

Both lists of values are contained in two fields of a table, and as I said, they've been created using a LISTAGG function while aggregating rows from other tables - and unfortunately LISTAGG doesn't remove duplicates (BTW, is it there a way to do that with LISTAGG?). Would love to have a REGEXP_REPLACE able to get the wanted result in a SQL expression.


Can anyone help? THANK YOU SO MUCH IN ADVANCE!!!

Claudio de Biasio

and Chris said...

The regular expression:

([^,]+)(,\1)+


Will find duplicates. You can then replace them with the backreference to \1

This works on numbers and words:

with rws as (
  select mod(rownum,3) x from dual connect by level <= 10
)
  select listagg(x, ',') within group (order by x) full,
         regexp_replace( listagg(x, ',') within group (order by x), '([^,]+)(,\1)+', '\1') de_duped
  from rws;
  
with rws as (
  select to_char(to_date(mod(rownum,3)+1, 'j'), 'jsp') x from dual connect by level <= 10
)
  select listagg(x, ',') within group (order by x) full,
         regexp_replace( listagg(x, ',') within group (order by x), '([^,]+)(,\1)+', '\1') de_duped
  from rws;


Note you need to watch for spacing. The first string has no leading space. So it doesn't match the next entry if there's a duplicate. Because "TOKEN ONE" <> " TOKEN ONE":

set define off  
select regexp_replace (
    'TOKEN ONE, TOKEN ONE, TOKEN ONE, TOKEN ONE, EXAMPLES, SECOND, SECOND, ANOTHER ONE, ANOTHER ONE, ANOTHER ONE, APPLES, APPLES, APPLES, APPLES, APPLES, BANANAS & PEERS',
    '([^,]+)(,\1)+', '\1'
 ) de_dup
from dual;

DE_DUP                                                                        
TOKEN ONE, TOKEN ONE, EXAMPLES, SECOND, ANOTHER ONE, APPLES, BANANAS & PEERS  


Trimming the spaces could lead to unexpected results:

set define off
select regexp_replace(
  replace (
    'TOKEN ONE, TOKEN ONE, TOKEN ONE, TOKEN ONE, EXAMPLES, SECOND, SECOND, ANOTHER ONE, ANOTHER ONE, ANOTHER ONE, APPLES, APPLES, APPLES, APPLES, APPLES, BANANAS & PEERS', 
    ', ', ','
  ),
  '([^,]+)(,\1)+', '\1'
) de_dup
from dual;

DE_DUP                                                             
TOKEN ONE,EXAMPLESECOND,SECOND,ANOTHER ONE,APPLES,BANANAS & PEERS


So you're better off adding a space at the start:

set define off  
select regexp_replace (
    ' TOKEN ONE, TOKEN ONE, TOKEN ONE, TOKEN ONE, EXAMPLES, SECOND, SECOND, ANOTHER ONE, ANOTHER ONE, ANOTHER ONE, APPLES, APPLES, APPLES, APPLES, APPLES, BANANAS & PEERS',
    '([^,]+)(,\1)+', '\1'
 ) de_dup
from dual;

DE_DUP                                                              
 TOKEN ONE, EXAMPLES, SECOND, ANOTHER ONE, APPLES, BANANAS & PEERS 

Rating

  (16 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

missing

A reader, October 21, 2016 - 10:04 am UTC

select regexp_replace( 'three0, three, two, two1, two2, two3, three2, three, three, one1, one2, one1', '([^,]+)(,\1)+', '\1') from dual;


three0, three, two1, two2, two3, three2, three, one1, one2, one1

"two" is missing !!
Chris Saxon
October 21, 2016 - 10:25 am UTC

Good catch.

If you know items are duplicated at most once, you could do:

select regexp_replace( ' one, one, one1, one1', '([^,]+)(,\1(,|$))+', '\1,') r
from dual; 

R            
 one, one1,


But this doesn't work with three or more matches...

select regexp_replace( ' one, one, one, one1, one1', '([^,]+)(,\1(,|$))+', '\1,') r
from dual; 

R                 
 one, one, one1,


Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. - Jamie Zawinski

sorry but look at " three one1"

A reader, October 21, 2016 - 11:16 am UTC

select regexp_replace( 'three0, three, two, two1, two2, two3, three2, three, three, one1, one2, one1', '([^,]+)(,\1(,|$))+', '\1') from dual;


three0, three, two, two1, two2, two3, three2, three one1, one2, one1
Chris Saxon
October 21, 2016 - 3:56 pm UTC

Hmm. How about:
([^,]+)(,\1)+(,|$)


Then reference both \1 and \3 in the replacement:

SQL> select regexp_replace( 'three0, three, two, two, two2, two3, three, three, three, one1, one1', '([^,]+)(,\1)+(,|$)', '\1\3') from dual;

REGEXP_REPLACE('THREE0,THREE,TWO,TWO,TWO2,T
-------------------------------------------
three0, three, two, two2, two3, three, one1

Perfect match

Claudio Augusto de Biasio, October 24, 2016 - 8:10 am UTC

Hi Chris,

the '([^,]+)(,\1)+(,|$)', '\1\3' RegExp is definitely working perfectly. The previous version failed when strings like '1, 14, 14' were found. This one works like a charm.

It seems it was not so trivial, I'm a little bit .. 'heartened'!

Thank you again for this one! :-)

Claudio de Biasio
Chris Saxon
October 24, 2016 - 10:28 am UTC

Thanks, glad we got there eventually :)

Not working with on some case

Nimish Rastogi, January 30, 2019 - 8:32 am UTC

</>
select regexp_replace (

'a,ab,b,bc',

'([^,]+)(,\1)+(,|$)', '\1\3'

)

from dual;
</>
Chris Saxon
February 04, 2019 - 1:10 pm UTC

How about:

select regexp_replace (
'a,ab,b,bc',
'((^|,)[^,]+)(,\1)+(,|$)', '\1\3'
)
from dual;

Sorry...

Claudio A. de Biasio, February 04, 2019 - 5:34 pm UTC

Hi Chris, thank you for your attempts of improving the answer to the question I posted some time ago - actually I'm always interested in keeping up with this kind of matters until they're completely solved, but...

... after 7 years, I closed the consulting experience with the Customer where I first had that problem, and now I'm working with another Corporate customer using MS SQL Server technology. Also, I've no longer any access to an Oracle Server, so I cannot even test your solution. I'm very sorry about this, I wish I could have been able to see if it fixed my issue...

ANYWAY. I really think the point is REALLY important, so I hope that someone else will come up and test these regular expressions with their data - maybe providing an even better strategy.

Thank you for all your support, it has been very precious for me!!!

Claudio de Biasio
Chris Saxon
February 05, 2019 - 12:59 pm UTC

Thanks for getting back in touch Claudio!

FYI, if you ever want to play around with Oracle Database without the hassle of installing it in some way, you can use Live SQL. This is a free, browser-based, interactive coding environment:

https://livesql.oracle.com/

Not everything works (e.g. external tables), but still a handy sandbox for simple stuff.

some distinguishing features

Racer I., February 05, 2019 - 2:13 pm UTC

Hi,

https://community.oracle.com/ideas/12533

There's also some discussion on which regexp might do it, but if the statement about LISTAGG-DISTINCT coming with 19c is true that would make the point moot.

regards,
Chris Saxon
February 05, 2019 - 5:05 pm UTC

Listagg distinct is here with 19c, you can check it out on Live SQL :)

https://livesql.oracle.com/apex/livesql/file/content_HT1O85E4BHSBWN93G1B3M8SI2.html

Wow!

Claudio Augusto de Biasio, February 05, 2019 - 10:27 pm UTC

Hi Chris, thank you so much! Didn't really know anything about Online ORACLE sandbox - it has been a real refreshing update!!! Sure enough I will use it from time to time to test something.

Well, if you're lucky enough to get hands on a rel 19c, actually the whole point is totaly useless - this new version of LISTAGG DISTINCT makes exactly what it had to do since immediately! :-) Anyway, I bet there're a lot (most) guys out there who are not interacting with the very latest version on the market, so our post could still be useful for quite some time, I guess...

Thank you for this update - and for the revised, final REGEXP, which works perfectly!
Connor McDonald
February 06, 2019 - 12:34 am UTC

glad we could help

Failing in my case

RajatGupta, November 26, 2019 - 12:28 pm UTC

string
a,b,b,bb,b,b,b
EXPECTED OUTPUT: a,b,bb,b (considering the input data)

REGEXP:
'([^,]+)(,\1)+','\1')

OUTPUT: a,bb

REGEXP:
'((^|,)[^,]+)(,\1)+(,|$)','\1\3')

OUTPUT: a,b,b,bb,b,b,b

REGEXP:
'([^,]+)(,\1)+(,|$)', '\1\3')

OUTPUT: a,b,bb
Chris Saxon
November 26, 2019 - 4:03 pm UTC

It may be easier to use SQL to:

- Split the string into rows
- Find consecutive rows with the same value
- Combine them back into one row

e.g.:

with vals as (
  select 'a,b,b,bb,b,b,b' str from dual
), rws as (
  select str, rownum x 
  from   vals, lateral (
    select * from dual
    connect by level <= length ( str ) - length ( replace ( str, ',' ) )
  ) 
), splits as (
  select x, str, regexp_substr ( str, '[^,]+', 1, x ) sub
  from   rws
)
  select listagg ( sub, ',' ) 
           within group ( order by x )
  from splits match_recognize (
    order by x
    measures
      first ( x ) x,
      first ( sub ) sub
    pattern ( init same* )
    define
      same as sub = prev ( sub )
  );
  
LISTAGG(SUB,',')WITHINGROUP(ORDERBYX)   
a,b,bb,b    

Re: Failing in my case

Stew Ashton, November 27, 2019 - 9:38 am UTC

It appears that the regexp pattern accepts back references, so it can tell when the same value is repeated in a string.
with vals as (
  select 'a,b,b,bb,b,b,b' str from dual
)
select rtrim(
  regexp_replace(str||',', '([^,]+,)(\1)*', '\1'),
  ','
) deduped_str
from vals;

DEDUPED_
--------
a,b,bb,b

Best regards,
Stew Ashton
Chris Saxon
November 27, 2019 - 10:45 am UTC

Nice stuff Stew!

That pesky regexp

AndyP, November 28, 2019 - 11:12 am UTC

It's my observation over many years that whenever anyone, even someone as illustrious as Chris, tries to use regexp for anything other than really straightforward cases, it fails

I would posit that the syntax is just too terse and obscure to work as intended without undergoing multiple unsuccessful attempts first

Hence the thing about two problems

In this case, and without using any post-version-11 features (I know, who still uses Oracle 11 right?):

col newlist for a70

with data as
(
select '1, 5, 5, 7, 7, 7, 10, 11, 11, 11, 15, 15, 16' numlist
      ,'TOKEN ONE, TOKEN ONE, TOKEN ONE, TOKEN ONE, EXAMPLES, SECOND, SECOND, ANOTHER ONE, ANOTHER ONE, ANOTHER ONE, APPLES, APPLES, APPLES, APPLES, APPLES, BANANAS n PEARS' strlist
  from dual
)
,numvalues as
(
select numlist
      ,instr(numlist,',',1,level) pos
  from data
connect by level <= length(numlist)-nvl(length(replace(numlist,',','')),0)+1
)
,strvalues as
(
select strlist
      ,instr(strlist,',',1,level) pos
  from data
connect by level <= length(strlist)-nvl(length(replace(strlist,',','')),0)+1
)
,valueset as
(
select 'numlist' col
      ,trim(substr(numlist,pos+1,lead(pos,1,4000) over(order by pos)-pos-1)) subs
  from numvalues
union all
select 'strlist' col
      ,trim(substr(strlist,pos+1,lead(pos,1,4000) over(order by pos)-pos-1)) subs
  from strvalues
)
,allvalues as
(
select col,subs
  from valueset
 group by col,subs
)
select col,listagg (subs,',') within group (order by subs) newlist
  from allvalues
 group by col
/

COL     NEWLIST
------- ----------------------------------------------------------------------
numlist 1,10,11,15,16,5,7
strlist ANOTHER ONE,APPLES,BANANAS n PEARS,EXAMPLES,SECOND,TOKEN ONE



I expect that could be condensed but on the other hand maybe the step-by-step approach is helpful


Chris Saxon
November 28, 2019 - 11:29 am UTC

I agree - regexes are notoriously easy to get wrong!

Thanks for adding the "pure SQL" solutions :)

Duplicates still not removed

arun, February 05, 2020 - 2:55 pm UTC

select regexp_replace('1739, 1739, 1745, 1745, 1748, 1748, 1755, 1755, 1758, 1758, 1758, 1758, 1758, 1758, 1761, 1761, 1764, 1764, 1765, 1765, 1766, 1766, 1767, 1767, 1768, 1768, 1771, 1771, 1772, 1772, 1774, 1774, 1774, 1774, 1774, 1774, 1781, 1781, 1786, 1786', '([^,]+)(,\1)+', '\1') from dual; executing the above query still returns duplicate values... not sure why... please help me with explanation on what i have done wrong
Chris Saxon
February 05, 2020 - 3:17 pm UTC

It's down to the spaces.

The first characters are (\s) for emphasis:

1739,\s1739


The regex searches for the non-comma characters. From the start this matches 1739.

It then searches for this set of characters after a comma. But the next section is:

,\s1739


1739 <> \s1739. So it doesn't find this.

Adding a leading space at the start resolves this. Or removing all whitespace before applying the regex.

Sorry, but have to ask it :)

A reader, February 06, 2020 - 6:34 am UTC


We have a requirement where we need to migrate stuff from SQL Server to Oracle. One of the SQLs uses a PARSENAME, so essentially, we need to read the following string from the back. Any suggestions?

SELECT        
'AdventureWorksPDW2012.dbo.DimCustomer' Original_text
regexp_substr('AdventureWorksPDW2012.dbo.DimCustomer', '[^.]+$',1,1), --returns DimCustomer
regexp_substr('AdventureWorksPDW2012.dbo.DimCustomer', '[^.]+$',1,1), --returns dbo
regexp_substr('AdventureWorksPDW2012.dbo.DimCustomer', '[^.]+$',1,1), --returns AdventureWorksPDW2012
FROM DUAL;

Needless to say the above SQL is incorrect the same values for all the 3 columns. Any help/advice will be much appreciated.


Thanks in advance 



Chris Saxon
February 06, 2020 - 1:26 pm UTC

The 4th parameter of regexp_substr is the occurrence. So provided you know there's always three parts you can use this to work backwards:
SELECT        
'AdventureWorksPDW2012.dbo.DimCustomer' Original_text,
regexp_substr('AdventureWorksPDW2012.dbo.DimCustomer', '[^.]+',1,3) c1, --returns DimCustomer
regexp_substr('AdventureWorksPDW2012.dbo.DimCustomer', '[^.]+',1,2) c2, --returns dbo
regexp_substr('AdventureWorksPDW2012.dbo.DimCustomer', '[^.]+',1,1) c3--returns AdventureWorksPDW2012
FROM DUAL;

ORIGINAL_TEXT                            C1             C2     C3                      
AdventureWorksPDW2012.dbo.DimCustomer    DimCustomer    dbo    AdventureWorksPDW2012 

The duplicate values are removed but some values get merged

arun, February 06, 2020 - 7:50 am UTC

<p>I executed the following query:</p>
<p>
select regexp_replace('1739,1745,1748,1755,1755,1758,1761,1764,1765,1766,1767,1768,1771,1772,1774,1781,1786','([^,]+)(,\1)+', '\1') from dual;</p>
<p>
Following is the output:
1739,1745,1748,1755,1758,1761764,1765,1766,1767,1768,1771772,1774,1781786
</p>
<p>
The highlighted values are the merged values.</p>

<p>Thanks for the previous update</p>
Chris Saxon
February 06, 2020 - 1:11 pm UTC

It's matching when consecutive pairs have the same end then start value.

e.g.:

1761,1764

Has one comma one.

The regex matches the first "one". Then a comma followed by another "one". Which is why they're merged.

As I mentioned in another follow-up, it's safer to:

- Split the string into rows
- Find consecutive rows with the same value
- Combine them back into one row

To Arun

Rajeshwaran, Jeyabal, February 06, 2020 - 2:37 pm UTC

demo@PDB1> variable x varchar2(120)
demo@PDB1> exec :x := '1739,1745,1748,1755,1755,1758,1761,1764,1765,1766,1767,1768,1771,1772,1774,1781,1786';

PL/SQL procedure successfully completed.

demo@PDB1> select regexp_replace( :x,'([^,]+,)(\1)+','\1') final_results from dual;

FINAL_RESULTS
--------------------------------------------------------------------------------------------------------------
1739,1745,1748,1755,1758,1761,1764,1765,1766,1767,1768,1771,1772,1774,1781,1786

demo@PDB1>

Chris Saxon
February 07, 2020 - 9:39 am UTC

Be careful - that won't remove duplicates at the end of the string:

variable x varchar2(120)
exec :x := '1739,1739,1786,1786';
select regexp_replace( :x,'([^,]+,)(\1)+','\1') final_results from dual;

FINAL_RESULTS    
1739,1786,1786    

Thank you Chris much apprecaited

A reader, February 07, 2020 - 6:04 am UTC

For reading the string in reverse for REGEXP_SUBSTR

remove duplicates at the end of the string:

Rajeshwaran, Jeyabal, February 07, 2020 - 12:49 pm UTC

demo@PDB1> variable x varchar2(120)
demo@PDB1> exec :x := '1739,1739,1786,1786';

PL/SQL procedure successfully completed.

demo@PDB1>
demo@PDB1> select regexp_replace( :x,'([^,]+,)(\1)+','\1') final_results from dual;

FINAL_RESULTS
------------------------------------------------------------------------------------
1739,1786,1786

demo@PDB1>


the idea behind the above approach is to take each non-delimited values and delimited character, together
and look for one or more repeated co-located occurance as such.
if so have them replaced

Just tweek the above code to have the delimited character included in the end of the input string - so that it removes the repeated co-located character even at the end of the input string.

something like this:

demo@PDB1> select  regexp_replace( new_x, '([^,]+,)(\1)+','\1') final_results
  2  from (
  3  select case when substr( :x,-1,1) <> ',' then :x||',' else :x end new_x
  4  from dual
  5      )
  6  /

FINAL_RESULTS
------------------------------------------------------------------------------
1739,1786,

demo@PDB1>