You Asked

*** TABLE DDL

create table internal_table ( id number, timeseries_name varchar2(50), timeseries_description varchar2(250) );

insert into internal_table values( 1, 'macro1', 'macro_timeseries1');
insert into internal_table values( 2, 'macro2', 'macro_timeseries2');

create directory "EXTERNAL_DATA" AS '/home/**user**';
-- with **user** as suitable user home folder

CREATE TABLE "EXTERNAL_TABLE"
( "TIMESERIES_NAME" VARCHAR2(50 BYTE),
"TIMESERIES_DATE" DATE,
"TIMESERIES_VALUE" NUMBER
)
ORGANIZATION EXTERNAL
( TYPE ORACLE_LOADER
DEFAULT DIRECTORY "EXTERNAL_DATA"
ACCESS PARAMETERS
( RECORDS DELIMITED BY NEWLINE
CHARACTERSET AL32UTF8
STRING SIZES ARE IN BYTES
NOBADFILE
NODISCARDFILE
NOLOGFILE
FIELDS
TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"' AND '"'
NOTRIM
(
"TIMESERIES_NAME" CHAR,
"TIMESERIES_DATE" DATE "YYYYMMDD",
"TIMESERIES_VALUE"
)
)
LOCATION
( 'TIMESERIES_FILE.txt'
)
)
REJECT LIMIT UNLIMITED PARALLEL 8;

**** END TABLE DDL

Join external_table with internal_table in order to join the timeseries_value in the external_table with appropriate id in internal_table:

CREATE TABLE stage_timeseries nologging AS
SELECT internal_table.id, external_table.timeseries_date, external_table.timeseries_value
from external_table, internal_table
where external_table.timeseries_name = internal_table.timeseries_name;

Issue: The TIMESERIES_FILE.txt text file for the external_table has duplicate rows (defined by >1 combinations of timeseries_name and timeseries_date). The combination of timeseries_name and timeseries_date should be unique in the database, but it is not unique in the text file for the external_table. We have some duplicate rows.

I can delete the duplicate rows in the stage_timeseries table I created:

DELETE
FROM stage_timeseries a
WHERE ROWID !=
(SELECT MAX(ROWID)
FROM stage_timeseries b
WHERE b.timeseries_id = a.timeseries_id
AND a.timeseries_date = b.timeseries_date
);

BUT, that TIMESERIES_FILE.txt text file is a 15GB file containing approximately 180 million rows, so the delete command has to run likewise over the 180 million rows.

How might I combine the "SELECT MAX(ROWID)..." portion of the delete command into the select query of the above CTAS statement such that the creation of the stage_timeseries table contains unique rows for each (timeseries_id,timeseries_date)?

CREATE TABLE stage_timeseries nologging AS
SELECT internal_table.id, external_table.timeseries_date, external_table.timeseries_value
from external_table, internal_table
where external_table.timeseries_name = internal_table.timeseries_name
and ***unique combination of external_table.timeseries_name, external_table.timeseries_date***

Thank you, Tom

and Tom said...

apparently, you don't care which record you keep - just keep one record per id/date

So, as you select it, number the rows by id/date and only keep the first.

select id, timeseries_date, timeseries_value
  from (
SELECT internal_table.id,  
       external_table.timeseries_date, 
       external_table.timeseries_value,<b>
       row_number() 
         over (partition by id, timeseries_date order by timeseries_value) rn</b>
  from external_table, internal_table
 where external_table.timeseries_name = internal_table.timeseries_name
       )<b>
 where rn = 1;</b>

Is this answer out of date? If it is, please let us know via a Comment

More to Explore

Analytics

Analytic SQL got you confused? Check out Connor McDonald's complete video course.

Breadcrumb

Question and Answer

You Asked

and Tom said...

More to Explore

Analytics

Removing Duplicate Rows While Loading from External Table

Breadcrumb

Question and Answer

You Asked

and Tom said...

More to Explore

Analytics