Skip to Main Content
  • Questions
  • Removing Duplicate Rows While Loading from External Table

Breadcrumb

Question and Answer

Tom Kyte

Thanks for the question, Michael.

Asked: January 18, 2009 - 1:37 am UTC

Last updated: January 19, 2009 - 11:24 am UTC

Version: 11.1.0.7

Viewed 1000+ times

You Asked

*** TABLE DDL

create table internal_table ( id number, timeseries_name varchar2(50), timeseries_description varchar2(250) );

insert into internal_table values( 1, 'macro1', 'macro_timeseries1');
insert into internal_table values( 2, 'macro2', 'macro_timeseries2');

create directory "EXTERNAL_DATA" AS '/home/**user**';
-- with **user** as suitable user home folder

CREATE TABLE "EXTERNAL_TABLE"
( "TIMESERIES_NAME" VARCHAR2(50 BYTE),
"TIMESERIES_DATE" DATE,
"TIMESERIES_VALUE" NUMBER
)
ORGANIZATION EXTERNAL
( TYPE ORACLE_LOADER
DEFAULT DIRECTORY "EXTERNAL_DATA"
ACCESS PARAMETERS
( RECORDS DELIMITED BY NEWLINE
CHARACTERSET AL32UTF8
STRING SIZES ARE IN BYTES
NOBADFILE
NODISCARDFILE
NOLOGFILE
FIELDS
TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"' AND '"'
NOTRIM
(
"TIMESERIES_NAME" CHAR,
"TIMESERIES_DATE" DATE "YYYYMMDD",
"TIMESERIES_VALUE"
)
)
LOCATION
( 'TIMESERIES_FILE.txt'
)
)
REJECT LIMIT UNLIMITED PARALLEL 8;

**** END TABLE DDL

Join external_table with internal_table in order to join the timeseries_value in the external_table with appropriate id in internal_table:

CREATE TABLE stage_timeseries nologging AS
SELECT internal_table.id, external_table.timeseries_date, external_table.timeseries_value
from external_table, internal_table
where external_table.timeseries_name = internal_table.timeseries_name;

Issue: The TIMESERIES_FILE.txt text file for the external_table has duplicate rows (defined by >1 combinations of timeseries_name and timeseries_date). The combination of timeseries_name and timeseries_date should be unique in the database, but it is not unique in the text file for the external_table. We have some duplicate rows.

I can delete the duplicate rows in the stage_timeseries table I created:

DELETE
FROM stage_timeseries a
WHERE ROWID !=
(SELECT MAX(ROWID)
FROM stage_timeseries b
WHERE b.timeseries_id = a.timeseries_id
AND a.timeseries_date = b.timeseries_date
);

BUT, that TIMESERIES_FILE.txt text file is a 15GB file containing approximately 180 million rows, so the delete command has to run likewise over the 180 million rows.

How might I combine the "SELECT MAX(ROWID)..." portion of the delete command into the select query of the above CTAS statement such that the creation of the stage_timeseries table contains unique rows for each (timeseries_id,timeseries_date)?


CREATE TABLE stage_timeseries nologging AS
SELECT internal_table.id, external_table.timeseries_date, external_table.timeseries_value
from external_table, internal_table
where external_table.timeseries_name = internal_table.timeseries_name
and ***unique combination of external_table.timeseries_name, external_table.timeseries_date***


Thank you, Tom

and Tom said...

apparently, you don't care which record you keep - just keep one record per id/date

So, as you select it, number the rows by id/date and only keep the first.

select id, timeseries_date, timeseries_value
  from (
SELECT internal_table.id,  
       external_table.timeseries_date, 
       external_table.timeseries_value,<b>
       row_number() 
         over (partition by id, timeseries_date order by timeseries_value) rn</b>
  from external_table, internal_table
 where external_table.timeseries_name = internal_table.timeseries_name
       )<b>
 where rn = 1;</b>



Is this answer out of date? If it is, please let us know via a Comment

More to Explore

Analytics

Analytic SQL got you confused? Check out Connor McDonald's complete video course.