First up, you need to split the CSVs into rows. Here's one way you can do this:
create table t (id int, toaddress clob, domain varchar(4));
insert into t (id, toaddress, domain) values (1, 'xxx@gmail.com,yyy@gmail.com,zzz@gmail.com', 'root');
insert into t (id, toaddress, domain) values (2, 'aaa@gmail.com,bbb@gmail.com,xxx@gmail.com', 'abc');
insert into t (id, toaddress, domain) values (3, 'ccc@gmail.com,xxx@gmail.com', 'root');
insert into t (id, toaddress, domain) values (4, 'yyy@gmail.com,zzz@gmailcom', 'root');
insert into t (id, toaddress, domain) values (5, 'ddd@gmail.com', 'abc');
select
trim(
regexp_substr(t.toaddress, '[^,]+', 1, levels.column_value)
) as email,
domain
from t,
table(cast(multiset(
select level from dual
connect by level <= length (regexp_replace(t.toaddress, '[^,]+')) + 1
) as sys.OdciNumberList)) levels;
EMAIL DOMA
-------------------- ----
xxx@gmail.com root
yyy@gmail.com root
zzz@gmail.com root
aaa@gmail.com abc
bbb@gmail.com abc
xxx@gmail.com abc
ccc@gmail.com root
xxx@gmail.com root
yyy@gmail.com root
zzz@gmailcom root
ddd@gmail.com abc
You can't just slap a distinct on this though. You'll get an "ORA-00932: inconsistent datatypes: expected - got CLOB" error.
To overcome this, cast the email to varchar2:
with emails as (
select
cast(trim(
regexp_substr(t.toaddress, '[^,]+', 1, levels.column_value)
) as varchar2(320)) as email,
domain,
id
from t,
table(cast(multiset(
select level from dual
connect by level <= length (regexp_replace(t.toaddress, '[^,]+')) + 1
) as sys.OdciNumberList)) levels
)
select distinct substr(email, 1, instr(email, '@')-1), email, domain
from emails;
NAME EMAIL DOMA
---------- -------------------- ----
bbb bbb@gmail.com abc
xxx xxx@gmail.com abc
zzz zzz@gmailcom root
aaa aaa@gmail.com abc
yyy yyy@gmail.com root
ddd ddd@gmail.com abc
zzz zzz@gmail.com root
ccc ccc@gmail.com root
xxx xxx@gmail.com root
You still need to remove the duplicate (name, domain) pairs though. You can't use distinct for this.
Instead, for each (name, domain) pair you want a counter. Row_number() does this for you. You can then return those rows where the row number = 1.
Add another call to row_number() to your final select to assign the ids and you're done:
with emails as (
select
cast(trim(
regexp_substr(t.toaddress, '[^,]+', 1, levels.column_value)
) as varchar2(320)) as email,
domain,
id
from t,
table(cast(multiset(
select level from dual
connect by level <= length (regexp_replace(t.toaddress, '[^,]+')) + 1
) as sys.OdciNumberList)) levels
), names as (
select substr(email, 1, instr(email, '@')-1) name, email, domain, id,
row_number() over (
partition by substr(email, 1, instr(email, '@')-1), domain
order by id
) rn
from emails
)
select row_number() over (order by id, name) id,
name, email, domain
from names
where rn = 1;
ID NAME EMAIL DOMA
---------- ---------- -------------------- ----
1 xxx xxx@gmail.com root
2 yyy yyy@gmail.com root
3 zzz zzz@gmail.com root
4 aaa aaa@gmail.com abc
5 bbb bbb@gmail.com abc
6 xxx xxx@gmail.com abc
7 ccc ccc@gmail.com root
8 ddd ddd@gmail.com abc
But... a unique constraint over (name, domain) seems strange?
The domains appear to be separate from the email domain. The emails below both start chris.saxon, but are separate addresses:
chris.saxon@here.com
chris.saxon@there.come
So these could be for different people. If these are both under the "root" domain you'll lose one!
How are you sure this won't happen?