Skip to Main Content
  • Questions
  • How to split comma seperated column of clob datatype and insert distinct rows into another table?

Breadcrumb

more

Connor and Chris don't just spend all day on AskTOM. You can also catch regular content via Connor's blog and Chris's blog. Or if video is more your thing, check out Connor's latest video and Chris's latest video from their Youtube channels. And of course, keep up to date with AskTOM via the official twitter account.

Question and Answer

Chris Saxon

Thanks for the question, swarna.

Asked: June 22, 2016 - 8:07 am UTC

Answered by: Chris Saxon - Last updated: June 24, 2016 - 2:52 pm UTC

Category: Database - Version: 12.1.0.2.0

Viewed 10K+ times! This question is

Whilst you are here, check out some content from the AskTom team: Tightening up your data model

You Asked

Hi,

I need to split the comma separated values of clob datatype column in one table and insert only the distinct rows in another table.

The details of the table are given below. The toaddress column in Table A is of datatype CLOB. Table B has unique constraint on columns (name,domain).

for ex:

Table A

id toaddress domain
1 xxx@gmail.com,yyy@gmail.com,zzz@gmail.com root
2 aaa@gmail.com,bbb@gmail.com,xxx@gmail.com abc
3 ccc@gmail.com,xxx@gmail.com root
4 yyy@gmail.com,zzz@gmail.com root
5 ddd@gmail.com abc

table B

id name emailaddress domain
1 xxx xxx@gmail.com root
2 yyy yyy@gmail.com root
3 zzz zzz@gmail.com root
4 aaa aaa@gmail.com abc
5 bbb bbb@gmail.com abc
6 xxx xxx@gmail.com abc
7 ccc ccc@gmail.com root
8 ddd ddd@gmail.com abc

how to achieve this using a procedure.

Thanks in advance.

and we said...

First up, you need to split the CSVs into rows. Here's one way you can do this:

create table t (id int, toaddress clob, domain varchar(4));
    
insert into t (id, toaddress, domain) values (1, 'xxx@gmail.com,yyy@gmail.com,zzz@gmail.com', 'root');
insert into t (id, toaddress, domain) values (2, 'aaa@gmail.com,bbb@gmail.com,xxx@gmail.com', 'abc');
insert into t (id, toaddress, domain) values (3, 'ccc@gmail.com,xxx@gmail.com', 'root');
insert into t (id, toaddress, domain) values (4, 'yyy@gmail.com,zzz@gmailcom', 'root');
insert into t (id, toaddress, domain) values (5, 'ddd@gmail.com', 'abc');

select 
  trim(
    regexp_substr(t.toaddress, '[^,]+', 1, levels.column_value)
  ) as email,
  domain
from  t,
      table(cast(multiset(
        select level from dual 
        connect by level <= length (regexp_replace(t.toaddress, '[^,]+')) + 1
      ) as sys.OdciNumberList)) levels;

EMAIL                DOMA
-------------------- ----
xxx@gmail.com        root
yyy@gmail.com        root
zzz@gmail.com        root
aaa@gmail.com        abc 
bbb@gmail.com        abc 
xxx@gmail.com        abc 
ccc@gmail.com        root
xxx@gmail.com        root
yyy@gmail.com        root
zzz@gmailcom         root
ddd@gmail.com        abc


You can't just slap a distinct on this though. You'll get an "ORA-00932: inconsistent datatypes: expected - got CLOB" error.

To overcome this, cast the email to varchar2:

with emails as (
  select 
    cast(trim(
      regexp_substr(t.toaddress, '[^,]+', 1, levels.column_value)
    ) as varchar2(320)) as email,
    domain,
    id
  from  t,
        table(cast(multiset(
          select level from dual 
          connect by level <= length (regexp_replace(t.toaddress, '[^,]+')) + 1
        ) as sys.OdciNumberList)) levels
)
  select distinct substr(email, 1, instr(email, '@')-1), email, domain 
  from   emails;

NAME       EMAIL                DOMA
---------- -------------------- ----
bbb        bbb@gmail.com        abc 
xxx        xxx@gmail.com        abc 
zzz        zzz@gmailcom         root
aaa        aaa@gmail.com        abc 
yyy        yyy@gmail.com        root
ddd        ddd@gmail.com        abc 
zzz        zzz@gmail.com        root
ccc        ccc@gmail.com        root
xxx        xxx@gmail.com        root


You still need to remove the duplicate (name, domain) pairs though. You can't use distinct for this.

Instead, for each (name, domain) pair you want a counter. Row_number() does this for you. You can then return those rows where the row number = 1.

Add another call to row_number() to your final select to assign the ids and you're done:

with emails as (
  select 
    cast(trim(
      regexp_substr(t.toaddress, '[^,]+', 1, levels.column_value)
    ) as varchar2(320)) as email,
    domain,
    id
  from  t,
        table(cast(multiset(
          select level from dual 
          connect by level <= length (regexp_replace(t.toaddress, '[^,]+')) + 1
        ) as sys.OdciNumberList)) levels
), names as (
  select substr(email, 1, instr(email, '@')-1) name, email, domain, id,
         row_number() over (
           partition by substr(email, 1, instr(email, '@')-1), domain
           order by id
         ) rn
  from   emails
)
  select row_number() over (order by id, name) id,
         name, email, domain
  from   names
  where  rn = 1;

        ID NAME       EMAIL                DOMA
---------- ---------- -------------------- ----
         1 xxx        xxx@gmail.com        root
         2 yyy        yyy@gmail.com        root
         3 zzz        zzz@gmail.com        root
         4 aaa        aaa@gmail.com        abc 
         5 bbb        bbb@gmail.com        abc 
         6 xxx        xxx@gmail.com        abc 
         7 ccc        ccc@gmail.com        root
         8 ddd        ddd@gmail.com        abc 


But... a unique constraint over (name, domain) seems strange?

The domains appear to be separate from the email domain. The emails below both start chris.saxon, but are separate addresses:

chris.saxon@here.com
chris.saxon@there.come

So these could be for different people. If these are both under the "root" domain you'll lose one!

How are you sure this won't happen?

and you rated our response

  (1 rating)

Is this answer out of date? If it is, please let us know via a Review

Reviews

Review - To Split comma separated values of clob datatype column into distinct rows in another table

June 24, 2016 - 12:47 pm UTC

Reviewer: swarna v

Thank you for your quick response.

The provided solution covered all the issues that we might encounter. Like you mentioned, the email id having different domain name but same user name will be discarded when we select distinct rows depending on name and domain (unique constraint). In our case, we allow to configure the email id with the same domain name and so the above problem may not arise. The email id domain name and the domain name column in table B are two different data.
Chris Saxon

Followup  

June 24, 2016 - 2:52 pm UTC

Thanks!

More to Explore

Analytics

Analytic SQL got you confused? Check out Connor McDonald's complete video course.