Skip to Main Content
  • Questions
  • How to remove multiple word occurance from an input string using oracle PL/SQL

Breadcrumb

Question and Answer

Connor McDonald

Thanks for the question, Arijit.

Asked: April 16, 2018 - 9:55 am UTC

Last updated: September 07, 2020 - 2:07 am UTC

Version: 10.1.2

Viewed 10K+ times! This question is

You Asked

Remove duplicate words from a address using oracle pl/sql:

There are two types of addresses will be there, below is the example

1. '3 Mayers Court 3 Mayers Court' : where total no of words in address is even and either all words/combination of words are duplicate.

2. 'Manor House Manor' or '1 Briar Cottages 1 Briar': where total no of words in address is odd and thus there is a middle word across which all words/combination of words on its left and right are duplicate.

3.'The Old Millers Cottage Wood Annex Hall The Old Millers Cottage' expected output is - 'The Old Millers Cottage Wood Annex Hall'. Here cottage is present so the duplicate can remove from either left side or else right side any one is correct.

4.'The Old Millers Cottage Farm Street Wood Annex Hall The Old' expected output is - 'The Old Millers Cottage Farm Street Wood Annex Hall'. Here cottage is present so the duplicate can remove from either left side or else right side any one is correct.

Like this way the space between a string could be 1 to 10 and there will be more examples like,

A202 A202
Flat A202 Flat
The Old The Old
The Old Millers The Old
The Old Millers The Old Millers
The Old Cottage Millers The Old
The Old Millers Cottage The Old Millers
The Old Millers Cottage Annex The Old
The Old Millers Cottage The Old Millers Cottage
The Old Millers Cottage Annex The Old Millers
The Old Cottage Wood Annex Hall The Old
The Old Millers Cottage Annex The Old Millers Cottage
The Old Millers Cottage Wood Annex The Old Millers
The Old Cottage Street Wood Annex Hall The Old
The Old Millers Cottage Annex The Old Millers Cottage Annex
The Old Millers Cottage Wood Annex The Old Millers Cottage
The Old Millers Cottage Wood Annex Hall The Old Millers
The Old Millers Cottage Street Wood Annex Hall The Old
The Old Millers Cottage Annex Hall The Old Millers Cottage Annex
The Old Millers Cottage Wood Annex Hall The Old Millers Cottage
The Old Millers Cottage Street Wood Annex Hall The Old Millers
The Old Millers Cottage Farm Street Wood Annex Hall The Old


So i have to remove all the different types of duplicate words in those above strings.
Any help would be appreciated.

I have tried the below code but unable to satisfied all the test cases. Below is my code,

CREATE OR REPLACE FUNCTION FN_ADD_CLEANUP(ADD_IN IN VARCHAR2)
RETURN VARCHAR2 IS
NO_SPACES NUMBER;
F_ADD VARCHAR2(255);
T_ADD VARCHAR2(255);
ADD_OUT VARCHAR2(255);

BEGIN

-- INITIALIZING THE PARAMETER VALUE
SELECT LTRIM(RTRIM(ADD_IN)) INTO F_ADD FROM DUAL;

-- SET NO OF SPACES
SELECT REGEXP_COUNT(F_ADD, ' ') INTO NO_SPACES FROM DUAL;

-- TRIMED STRING
SELECT TRIM(SUBSTR(F_ADD, INSTR(F_ADD, ' ', 1, NO_SPACES/2 + 1))) INTO T_ADD FROM DUAL;

-- RESULT
ADD_OUT := CASE WHEN F_ADD LIKE T_ADD||'%'
THEN SUBSTR(F_ADD, 1, LENGTH(F_ADD) - LENGTH(T_ADD))
ELSE F_ADD
END;

RETURN ADD_OUT;

END; 

and Chris said...

A little regular expression can help you:

(.*)(.*)\1


Will find:

1 any series of characters
2 followed by another series of any characters
3 followed by the string matched at step 1:

with rws as (
  select '3 Mayers Court 3 Mayers Court' str from dual union all
  select 'Manor House Manor' str from dual union all 
  select '1 Briar Cottages 1 Briar' str from dual union all
  select 'The Old Millers Cottage Wood Annex Hall The Old Millers Cottage' str from dual union all
  select 'The Old Millers Cottage Farm Street Wood Annex Hall The Old' str from dual
)
  select str, regexp_replace(str, '(.*)(.*)\1', '\1\2') from rws;

STR                                                               REGEXP_REPLACE(STR,'(.*)(.*)\1','\1\2')                
3 Mayers Court 3 Mayers Court                                     3 Mayers Court                                         
Manor House Manor                                                 Manor House                                            
1 Briar Cottages 1 Briar                                          1 Briar Cottages                                       
The Old Millers Cottage Wood Annex Hall The Old Millers Cottage   The Old Millers Cottage Wood Annex Hall                
The Old Millers Cottage Farm Street Wood Annex Hall The Old       The Old Millers Cottage Farm Street Wood Annex Hall    


But take care. This will only remove the first duplicate:

with rws as (
  select '3 Mayers Court 3 Mayers Court 3 Mayers Court' str from dual 
)
  select regexp_replace(str, '(.*)(.*)\1', '\1\2') from rws;

REGEXP_REPLACE(STR,'(.*)(.*)\1','\1\2')   
3 Mayers Court 3 Mayers Court             


And may not work on some data:

with rws as (
  select 'Big House 3 Manor House 3 Manor House' str from dual 
)
  select regexp_replace(str, '(.*)(.*)\1', '\1\2') from rws;

REGEXP_REPLACE(STR,'(.*)(.*)\1','\1\2')   
Big House 3 Manor House 3 Manor House     


Make sure you test thoroughly on your target data set.

Rating

  (9 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

With LATERAL and LISTAGG

Rajeshwaran, Jeyabal, April 19, 2018 - 1:44 pm UTC

drop table t purge;
create table t(x int,y varchar2(60) );
insert into t(x,y) values(1,'The Old Millers Cottage Wood Annex The Old Millers Cottage');
insert into t(x,y) values(2,'The Old Millers Cottage Wood Annex Hall The Old Millers');
insert into t(x,y) values(3,'The Old Millers Cottage Street Wood Annex Hall The Old');
insert into t(x,y) values(4,'Big House 3 Manor House 3 Manor House');
commit;


LATERAL and LISTAGG would do the trick in 12c and above.

Prior to that- table/cast/multiset along with stragg would help.

demo@ORA12C> select x,y, listagg(r1,' ') within group (order by x,r2) new_y
  2  from (
  3  select x,y , r1,r2,
  4      row_number() over( partition by x,r1 order by  r2) rn
  5  from t , lateral( select trim(regexp_substr( y,'\w+',1,level) ) r1, level r2
  6    from dual
  7    connect by level <= regexp_count( t.y, '\s+') ) t2
  8      )
  9  where rn = 1
 10  group by x,y    ;

         X Y                                                            NEW_Y
---------- ------------------------------------------------------------ ------------------------------------------------
         1 The Old Millers Cottage Wood Annex The Old Millers Cottage   The Old Millers Cottage Wood Annex
         2 The Old Millers Cottage Wood Annex Hall The Old Millers      The Old Millers Cottage Wood Annex Hall
         3 The Old Millers Cottage Street Wood Annex Hall The Old       The Old Millers Cottage Street Wood Annex Hall
         4 Big House 3 Manor House 3 Manor House                        Big House 3 Manor

demo@ORA12C>



Chris Saxon
April 19, 2018 - 2:57 pm UTC

Yep, that's another way.

Old School

Chuck Jolley, April 19, 2018 - 7:07 pm UTC

SQL*Plus: Release 11.2.0.1.0 Production on Thu Apr 19 14:01:29 2018

Copyright (c) 1982, 2010, Oracle.  All rights reserved.

SQL> drop table phrases;

Table dropped.

SQL> create table phrases
  2  as
  3  select 'The Old The Old' str from dual union all
  4     select 'The Old Millers The Old' from dual union all
  5     select 'The Old Millers The Old Millers' from dual union all
  6     select 'The Old Cottage Millers The Old' from dual union all
  7     select 'The Old Millers Cottage The Old Millers' from dual union all
  8     select 'The Old Millers Cottage Annex The Old' from dual union all
  9     select 'The Old Millers Cottage The Old Millers Cottage' from dual union all
 10     select 'The Old Millers Cottage Annex The Old Millers' from dual union all
 11     select 'The Old Cottage Wood Annex Hall The Old' from dual union all
 12     select 'The Old Millers Cottage Annex The Old Millers Cottage' from dual union all
 13     select 'The Old Millers Cottage Wood Annex The Old Millers' from dual union all
 14     select 'The Old Cottage Street Wood Annex Hall The Old' from dual union all
 15     select 'The Old Millers Cottage Annex The Old Millers Cottage Annex' from dual union all
 16     select 'The Old Millers Cottage Wood Annex The Old Millers Cottage' from dual union all
 17     select 'The Old Millers Cottage Wood Annex Hall The Old Millers' from dual union all
 18     select 'The Old Millers Cottage Street Wood Annex Hall The Old' from dual union all
 19     select 'The Old Millers Cottage Annex Hall The Old Millers Cottage Annex' from dual union all
 20     select 'The Old Millers Cottage Wood Annex Hall The Old Millers Cottage' from dual union all
 21     select 'The Old Millers Cottage Street Wood Annex Hall The Old Millers' from dual union all
 22     select 'The Old Millers Cottage Farm Street Wood Annex Hall The Old' from dual union all
 23     select '3 Mayers Court 3 Mayers Court' str from dual union all
 24     select 'Manor House Manor' str from dual union all
 25     select '1 Briar Cottages 1 Briar' str from dual union all
 26     select 'The Old Millers Cottage Wood Annex Hall The Old Millers Cottage' str from dual union all
 27     select 'The Old Millers Cottage Farm Street Wood Annex Hall The Old' str from dual union all
 28     select 'Big House 3 Manor House 3 Manor House' str from dual;

Table created.

SQL> commit;

Commit complete.

SQL> create or replace function no_dup_words(ar_param varchar2)
  2  return varchar2
  3  as
  4  v_working varchar2(4096) := '';
  5  v_result varchar2(4096) := '';
  6  v_word varchar2(4096) := '';
  7  begin
  8     v_working := ar_param || ' ';
  9     while v_working is not null loop
 10        v_word := trim(substr(v_working, 1 ,instr(v_working, ' ')));
 11        v_result := v_result || ' ' || v_word;
 12        v_working := ltrim(replace(v_working, v_word, ''));
 13     end loop;
 14     return trim(v_result);
 15  end;
 16      /

Function created.

SQL> select no_dup_words(str) ndw from phrases;

NDW
----------------------------------------------------------------------------------------------------------------------------------
The Old
The Old Millers
The Old Millers
The Old Cottage Millers
The Old Millers Cottage
The Old Millers Cottage Annex
The Old Millers Cottage
The Old Millers Cottage Annex
The Old Cottage Wood Annex Hall
The Old Millers Cottage Annex
The Old Millers Cottage Wood Annex
The Old Cottage Street Wood Annex Hall
The Old Millers Cottage Annex
The Old Millers Cottage Wood Annex
The Old Millers Cottage Wood Annex Hall
The Old Millers Cottage Street Wood Annex Hall
The Old Millers Cottage Annex Hall
The Old Millers Cottage Wood Annex Hall
The Old Millers Cottage Street Wood Annex Hall
The Old Millers Cottage Farm Street Wood Annex Hall
3 Mayers Court
Manor House
1 Briar Cottages
The Old Millers Cottage Wood Annex Hall
The Old Millers Cottage Farm Street Wood Annex Hall
Big House 3 Manor

26 rows selected.

SQL>

Connor McDonald
May 01, 2018 - 1:10 am UTC

nice work :-)

New and improved.

Chuck Jolley, May 02, 2018 - 7:46 pm UTC

Hmm... it didn't handle embedded words and numbers at all :(
Here it's fixed, I hope.

SQL> create table phrases as
  2  select '123 12345 345 123 345' str from dual union all
  3  select '1 South Southhampton' str from dual union all
  4  select '    The Old     The Oldman  ' str from dual union all
  5  select 'The Old The Oldman' str from dual union all
  6  select 'The Old Millers The Old' from dual union all
  7  select 'The Old Millers The Old Millers' from dual union all
  8  select '3 Mayers Court 3 Mayers Court' str from dual union all
  9  select 'Manor House Manor' str from dual union all
 10  select '2 Briar Cottages 2 Briar' str from dual;

Table created.

SQL>
SQL> create or replace function no_dup_words(ar_param varchar2)
  2      return varchar2
  3      as
  4      v_working varchar2(4096) := '';
  5      v_result varchar2(4096) := '';
  6      v_word varchar2(4096) := '';
  7      begin
  8         v_working := ar_param || ' ';
  9         while v_working is not null loop
 10           v_working := ltrim(v_working);
 11           v_word := ' '||trim(substr(v_working, 1 ,instr(v_working, ' ')))||' ';
 12           v_result := v_result || ' ' || trim(v_word);
 13           v_working := ' '||replace(v_working, ' ', '  ');
 14           v_working := ltrim(replace(v_working, v_word, ''));
 15        end loop;
 16        return trim(v_result);
 17     end;
 18  /

Function created.

SQL>
SQL> select str, no_dup_words(str) ndw from phrases;

STR
-------------------------------
NDW
--------------------------------
123 12345 345 123 345
123 12345 345

1 South Southhampton
1 South Southhampton

    The Old     The Oldman
The Old Oldman

The Old The Oldman
The Old Oldman

The Old Millers The Old
The Old Millers

The Old Millers The Old Millers
The Old Millers

3 Mayers Court 3 Mayers Court
3 Mayers Court

Manor House Manor
Manor House

2 Briar Cottages 2 Briar
2 Briar Cottages


9 rows selected.

SQL>


Slightly cleaner

Chuck Jolley, May 02, 2018 - 7:55 pm UTC

create or replace function no_dup_words(ar_param varchar2)
    return varchar2
    as
    v_working varchar2(4096) := '';
    v_result varchar2(4096) := '';
    v_word varchar2(4096) := '';
    begin
       v_working := replace(ar_param, ' ', '  ') || ' ';
       while v_working is not null loop
         v_working := ltrim(v_working);
         v_word := ' '||trim(substr(v_working, 1 ,instr(v_working, ' ')))||' ';
         v_result := v_result || ' ' || trim(v_word);
         v_working := ' '||v_working;
         v_working := ltrim(replace(v_working, v_word, ''));
      end loop;
      return trim(v_result);
   end;

Chris Saxon
May 03, 2018 - 10:28 am UTC

Good stuff.

Though you've got to watch for punctuation:

select no_dup_words('house house;') from dual;

NO_DUP_WORDS('HOUSEHOUSE;')   
house house;   

lol

Chuck Jolley, May 03, 2018 - 3:58 pm UTC

That's a business rule question.
What do they want done with punctuation?

leave it? - where?
strip it?

Luckily for me, the original question didn't seem to involve punctuation. ;)
Chris Saxon
May 03, 2018 - 5:07 pm UTC

:)

Select

Thiago Kato Prado, May 07, 2018 - 1:43 pm UTC

with t(x, y) as (
  select 1, 'The Old Millers Cottage Wood Annex The Old Millers Cottage' from dual
  union all select 2, 'The Old Millers Cottage Wood Annex Hall The Old Millers' from dual
  union all select 3, 'The Old Millers Cottage Street Wood Annex Hall The Old' from dual
  union all select 4, 'Big House 3 Manor House 3 Manor House' from dual
)
select x, y,
    (select listagg(word, ' ') within group (order by word_lvl)
     from (select
           row_number() over(partition by x, word order by word_lvl) as rn,
           word, word_lvl
           from (select trim(regexp_substr(t.y,'\S+', 1, level)) as word,
                        level as word_lvl
                   from dual
                 connect by level <= regexp_count(y, '\s+')))
       where rn = 1)
     as new_y
from t
;

Chris Saxon
May 09, 2018 - 10:26 am UTC

Yep, that's another approach.

ok fine. punctuation.

Chuck Jolley, May 15, 2018 - 3:57 pm UTC


SQL> drop table chajol.phrases;

Table dropped.

SQL> create table chajol.phrases as
  2  select 'house, ,,, house ,,, house,' str from dual union all
  3  select 'house house; house' str from dual union all
  4  select '123 12345 - 345 123 345' str from dual union all
  5  select '1 South, Southhampton' str from dual union all
  6  select '    The Old ;.,-=`    The Oldman  ' str from dual union all
  7  select ',The Old ,The Oldman' str from dual union all
  8  select 'The Old Millers The Old' from dual union all
  9  select 'The Old Millers The Old Millers' from dual union all
 10  select '3 Mayers Court 3 Mayers Court' str from dual union all
 11  select 'Manor House Manor' str from dual union all
 12  select '2 Briar Cottages 2 Briar' str from dual;

Table created.

SQL>
SQL> create or replace function chajol.no_dup_words(ar_param varchar2, v_done varchar2 default 0)
  2      return varchar2
  3      as
  4      v_working varchar2(4096) := '';
  5      v_result varchar2(4096) := '';
  6      v_word varchar2(4096) := '';
  7      v_1char varchar2(1) := '';
  8      begin
  9         for i in 1..length(ar_param) loop
 10            v_working := substr(ar_param, i, 1) || v_working;
 11         end loop;
 12         v_working := ltrim(replace(v_working, ' ', '  ')) || ' ';
 13         while v_working is not null loop
 14           v_word := ' '||trim(substr(v_working, 1 ,instr(v_working, ' ')))||' ';
 15           v_result := v_result || ' ' || trim(v_word);
 16           v_working := ' '||trim(replace(' '||v_working, v_word, ''))||' ';
 17           v_1char := substr(trim(v_word), 1, 1);
 18           if v_1char not between '0' and '9'
 19              and v_1char not between 'A' and 'Z'
 20              and v_1char not between 'a' and 'z'
 21              and length(trim(v_word)) > 1
 22              and trim(translate(upper(trim(v_word)),translate(upper(trim(v_word)), '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ', ' '), ' ')) is not null
 23              then
 24              v_word := ' '||substr(trim(v_word), 2) || ' ';
 25              v_result := ' '||replace(v_result, ' ', '  ')||' ';
 26              v_result := replace(v_result, v_word, '');
 27              v_result := replace(trim(v_result), '  ', ' ');
 28              v_working := replace(v_working, v_word, '');
 29           end if;
 30           v_working := ltrim(v_working);
 31        end loop;
 32        if v_done = 0 then
 33           v_working := chajol.no_dup_words(v_result, 1);
 34        end if;
 35        v_working := v_result;
 36        v_result := '';
 37        for i in 1..length(v_working) loop
 38           v_result := substr(v_working, i, 1) || v_result;
 39        end loop;
 40        return trim(v_result);
 41     end;
 42  /

Function created.

SQL>
SQL> select str, chajol.no_dup_words(str) ndw from chajol.phrases;

STR
----------------------------------
NDW
----------------------------------------------------------------------------------------------------------------------------------
house, ,,, house ,,, house,
,,, house,

house house; house
house;

123 12345 - 345 123 345
12345 - 123 345

1 South, Southhampton
1 South, Southhampton

    The Old ;.,-=`    The Oldman
Old ;.,-=` The Oldman

,The Old ,The Oldman
Old ,The Oldman

The Old Millers The Old
Millers The Old

The Old Millers The Old Millers
The Old Millers

3 Mayers Court 3 Mayers Court
3 Mayers Court

Manor House Manor
House Manor

2 Briar Cottages 2 Briar
Cottages 2 Briar


11 rows selected.

SQL>

Chris Saxon
May 15, 2018 - 4:25 pm UTC

Nice work :)

typo my test data didn't catch

Chuck Jolley, May 16, 2018 - 6:32 pm UTC

Line 33 should be:
v_result := chajol.no_dup_words(v_result, 1);

ok fine. punctuation.

Juan Carlos Mamani, September 05, 2020 - 8:41 am UTC

Congrats for your Skills!!

I have a doubt,

Why I get an error in line 11. Numeric Error or value

Thanks in advance!!
Connor McDonald
September 07, 2020 - 2:07 am UTC

we need to see the full execution and error

More to Explore

SQL

The Oracle documentation contains a complete SQL reference.