Skip to Main Content
  • Questions
  • Deduplication of rows with null values

Breadcrumb

Question and Answer

Chris Saxon

Thanks for the question, Praveen.

Asked: February 10, 2017 - 9:18 am UTC

Last updated: February 13, 2017 - 5:37 pm UTC

Version: 11.2.0.2.0

Viewed 1000+ times

You Asked

drop table test purge;

create table test (ssn int, address varchar2(10), tax_id varchar2(10), identify int);

insert into test values(100, '','', null);
insert into test values(111, '','', null);
insert into test values(111, 'xyz','', null);
insert into test values(null, 'xyz','', null);
insert into test values(111, '','222', null);
insert into test values(null, '','222', null);
insert into test values(122, '','', null);

commit;

--Question--
Could you please help me write a better query to assign a unique id for IDENTIFY field,
  so as to associate the same ID having same SSN or ADDRESS or TAX_ID assuming the person to be same.
where; "-" can be any value or null.

SSN ADDRESS TAX_ID IDENTIFY 
100    -      -      -       --> 1 -- a new id
111    -      -      -       --> 2 -- a new id
111   xyz     -      -       --> 2 -- ssn same
 -    xyz     -      -       --> 2 -- address same
111    -     222     -       --> 2 -- ssn same
 -     -     222     -       --> 2 -- tax_id same
122    -      -      -       --> 3 -- a new id

7 rows selected.

and Chris said...

So you want to assign an id, based on the SSN? But some rows have no SSN defined?

If so, you can find an SSN for the rows without one by joining to the table again. Look for rows where the address or tax_id matches the current row:

select t.*, 
       case 
         when ssn is null then 
           (select min(ssn) from test t2
            where  (t2.address = t.address) or
                   (t2.tax_id = t.tax_id)
           )
         else ssn
       end ssn_v2
from   test t;

SSN  ADDRESS  TAX_ID  IDENTIFY  SSN_V2  
100                             100     
111                             111     
111  xyz                        111     
     xyz                        111     
111           222               111     
              222               111     
122                             122


You can then use dense_rank, ordering by this expression, to assign the IDs:

select t.*, 
       dense_rank() over (order by case 
         when ssn is null then 
           (select min(ssn) from test t2
            where  (t2.address = t.address) or
                   (t2.tax_id = t.tax_id)
           )
         else ssn
       end) id
from   test t;

SSN  ADDRESS  TAX_ID  IDENTIFY  ID  
100                             1   
111  xyz                        2   
111                             2   
     xyz                        2   
111           222               2   
              222               2   
122                             3 


Note: for this to work, you need to be sure that the columns ADDRESS and TAX_ID uniquely identify someone. TAX_ID probably does, but ADDRESS almost certainly doesn't!

Most houses have multiple living in them. So if you have just an address with no SSN or TAX_ID you could incorrectly merge people...

For example, if you have another person at address xyz:

insert into test values(133, 'xyz','', null);


Which SSN does the row with only xyz set belong to? You don't know!

Is this answer out of date? If it is, please let us know via a Comment

More to Explore

Analytics

Analytic SQL got you confused? Check out Connor McDonald's complete video course.