You Asked

Hi Tom,

I have a SQL script which (runs daily) inserts data into a table and then de-duping is done based on the rowid. But recently I have observed that rowid is not incrementing with inserts and instead Oracle is re-using space to generate rowids. The row which got inserted later has a rowid lesser than the row which got inserted before the latest row.

Below is m sample data:
ROWID req_num activity create_dt update_dt
AAU2OBADnAAGTR2AAB 102031275 abc 4/12/17 2:24 PM 4/12/17 2:38 PM
AAU2OBABpAAKMBLAAC 102031275 def 4/12/17 3:32 PM 4/12/17 3:32 PM

You can clearly see that the second row has a lesser rowid than the first one.

Is there an alternative solution for de-duping with lesser processing time as the table has around 2Mil records with 42 columns?

Thanks!

and Connor said...

As you've seen, you cannot rely on the rowid for any kind of "order" when it comes to the data. You need to have an attribute on the row itself (eg a sequence, or timestamp) which will let you define what is the "latest"

The easiest way to de-dup is to not allow them :-) Why not have a unique constraint on req_num ?

If that is not feasible for some reason, then you can de-dup on (say) update_dt using:

delete from T
where rowid in ( 
  select rid
  from ( 
    select rowid rid, 
           row_number() over ( partition by req_num order by update_dt desc ) as rk
    from   t 
  ) where rk = 1
)

Rating

(4 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Did you mean that delete?

David Weigel, April 21, 2017 - 4:48 pm UTC

Won't that delete eliminate the most recent row for every req_num, whether it's a duplicate or not? Do you mean not in? (Assuming they want to keep the newest and not the oldest.)

April 22, 2017 - 1:38 am UTC

Whoops... fail on my part. Thanks for catching that.

delete from T
where rowid in ( 
  select rid
  from ( 
    select rowid rid, 
           row_number() over ( partition by req_num order by update_dt desc ) as rk
    from   t 
  ) where rk > 1   -- changed
)

Mahi V, April 25, 2017 - 3:33 pm UTC

Mahi V, April 25, 2017 - 3:40 pm UTC

Thank you for the useful response. The time to execute such query to process around 200K duplicate rows (where I want to retain the last inserted row and delete old ones) is taking lot of time. Is there an alternate way to do this?

delete from T ts
where rowid not in ( 
  select rid
  from (select rowid rid, 
           row_number() over ( partition by req_num order by update_dt desc) as rk
    from T
     where trunc(activity_dt) >= 'date'
  ) where rk = 1
  and trunc(activity_dt) >= 'date'
)

April 26, 2017 - 4:34 am UTC

Did you try this one I suggested

delete from T
where rowid in ( 
  select rid
  from ( 
    select rowid rid, 
           row_number() over ( partition by req_num order by update_dt desc ) as rk
    from   t 
  ) where rk > 1   -- changed
)

Works but time to execute is high

Mahi V, April 25, 2017 - 3:51 pm UTC

Thank you for the response. It helped me get an idea of de-dupe and now I want to delete the rows where I retain last inserted/updated row and delete the old ones. For this I have the below delete statement but the processing time is too high to execute it. The de-duping has to be done on 200K records.

delete from T ts
where rowid not in ( 
  select rid
  from (select rowid rid, 
           row_number() over ( partition by req_num order by update_dt desc) as rk
    from T
     where trunc(activity_dt) >= 'date'
  ) where rk = 1
  and trunc(activity_dt) >= 'date'
)

More to Explore

Analytics

Analytic SQL got you confused? Check out Connor McDonald's complete video course.

Breadcrumb

Question and Answer

You Asked

and Connor said...

Rating

Comments

Did you mean that delete?

Works but time to execute is high

More to Explore

Analytics

ROWID not incrementing. Need an alternative for deduping

Breadcrumb

Question and Answer

You Asked

and Connor said...

Rating

Comments

Did you mean that delete?

Works but time to execute is high

More to Explore

Analytics