Skip to Main Content
  • Questions
  • Effective way of migrating billions of records

Breadcrumb

Question and Answer

Connor McDonald

Thanks for the question, Branko.

Asked: December 08, 2015 - 10:35 pm UTC

Last updated: October 16, 2023 - 1:41 am UTC

Version: 11.2.0.4.0

Viewed 10K+ times! This question is

You Asked

Hi Tom,

I have a task to migrate data from 4 hierarchical tables into few hundred flattened tables. I have developed some options but am looking for possible improvements or more effective ways.

So I have 4 tables let's say:
create table t1(id number, name varchar2(20));
create table t2(id number, t1_id number, xmlmetadata clob);
create table t3(id number, t2_id number);
create table t4(id number, t3_id number, name varchar2(40), value varchar2(4000));

insert into t1 values(10,'WELLINGTON');
insert into t1 values(11,'AUCKLAND');
insert into t2 values(100,10,'<Attribute>firstname</Attribute><Attribute>lastname</Attribute>');
insert into t2 values(101,10,'<Attribute>firstname</Attribute><Attribute>gender</Attribute>');
insert into t2 values(102,11,'<Attribute>firstname</Attribute><Attribute>salary</Attribute>');
insert into t3 values(1001,100);
insert into t3 values(1002,100);
insert into t3 values(1003,101);
insert into t3 values(1004,102);
insert into t4 values(5001,1001,'firstname','John');
insert into t4 values(5002,1001,'lastname','Smith');
insert into t4 values(5003,1002,'firstname','Mark');
insert into t4 values(5004,1002,'gender','M');
insert into t4 values(5005,1004,'firstname','Alex');
insert into t4 values(5005,1004,'salary','50000');

Table T1 is parent of T2, T2 is parent of T3 and T3 is parent of T4. Tables have indexes and constraints. E.g. Column t1_id from table t2 has a foreign key to table t1 column id and so on for other tables.
Number of records expected in production in each table:
T1 = 400 records, T2 = 500 records, T3 = 250 Million records, T4 = 2 Billion records.

Please note that I have tested this in tables of size
T1 = 300, T2 = 350, T3 = 7M, T4 = 45M records.

Each record in T1 represents 1 flattened table. Columns of each table are defined in XMLMETADATA from T2.
Records from T3 and T4 are then joined together and inserted into correct table according to the parent id's.
The data from table T4 are in format of key-value and they will be transformed into horizontal/flattened table. E.g. multiple records from table T4 will become 1 record in new flattened table.

So the above example would result in 2 flattened tables:
Table WELLINGTON with columns "firstname", "lastname" and "gender" and table AUCKLAND with columns "firstname" and "salary".

Flattened data of table WELLINGTON would look like:
FIRSTNAME | LASTNAME | GENDER
John | Smith | null
Mark | null | M

Table AUCKLAND:
FIRSTNAME | SALARY
Alex | 50000

Long story short: I have to join all tables, pivot records from T4 and insert them into a new flattened table (FT). Name of the table is defined in T1, columns used are defined in T2 and data is from t3,t4.

I have developed the following solutions:
1) Option 1
For each record in T2 I created a dynamic SQL statement like:
INSERT into X SELECT Y from (select A from t1,t2,t3,t4 pivot ABC)
Please note that rules for pivoting are dynamic and are defined in table T2.
After each record in T2 I do a commit.

This is 1 example of my dynamic sql query. This is just to get the idea what am i doing in sql:
INSERT /*+ APPEND */ INTO MSDDET10_JN (evidenceid,evidencetype,evidencetypeversiondefid,versionno,STARTDATE_,ENDDATE_,CHILDORADDITIONALOCCUPANT_,GENDER_,DATEOFBIRTH_,TOTALDAYSINCARE_,OUTOFWEEKS_,CSOIND_,TYPEOFPAYMENTORTYPEOFCHILD_,jn_seq,jn_flag,jn_datetime)
(SELECT /*+ PARALLEL */ t2.*, iap_seq.NEXTVAL jn_seq, 'A' jn_flag, to_date('30/10/2015 16:03:09', 'dd/mm/yyyy hh24:mi:ss') jn_datetime FROM
(SELECT t1.* FROM (SELECT deda.evidenceid,etd.evidencetypecode evidencetype, ded.evidencetypeversiondefid, ded.versionno, deda.name,deda.VALUE
FROM dynamicevidencedata ded,dynamicevidencedataattribute deda,evidencetypedef etd,evidencetypeversiondef etvd
WHERE ded.evidenceid = deda.evidenceid AND ded.evidencetypeversiondefid = etvd.evidencetypeversiondefid AND etvd.evidencetypedefid = etd.evidencetypedefid
AND ded.evidencetypeversiondefid = 47570)
PIVOT (MAX(VALUE) FOR name IN ('startDate' STARTDATE_,'endDate' ENDDATE_,'childOrAdditionalOccupant' CHILDORADDITIONALOCCUPANT_,'gender' GENDER_,'dateOfBirth' DATEOFBIRTH_,'totalDaysInCare' TOTALDAYSINCARE_,'outOfWeeks' OUTOFWEEKS_,'csoInd' CSOIND_,'typeOfPaymentOrTypeOfChild' TYPEOFPAYMENTORTYPEOFCHILD_)) t1) t2)

This approach took about 55 mins to do the transformation/migration of my testing data (T4 = 45M records)
My concern with this approach is that in production where there will be 2B+ records, I have to join all 4 tables 500 times and I am afraid this is going to take ages even though I have indexes.
Even though tables have indexes it will sometimes use index when result set is small and do a full table scan when result set is big. Please correct me if I am wrong.

2) Option 2:
I tried to split these 4 tables into 350 temp tables so that each table would have all data that needs to be inserted into flattened table. In the second step I read temp table, do the pivot and insert into flattened tables.
Steps:
- using insert first when t1.name = 'AUCKLAND' then into tmp_auckland ... when t1.name = 'WELLINGTON' then into tmp_wellington ... and so on, select A,B,C from t1,t2,t3,t4 where <join condition>.
- then I read each record in t2 to get table structure(to be able to create dynamic sql with pivot) and insert data from temp tables into flattened tables. I create a dynamic statement similar to option1: insert into x select A from (pivot temp tables). I don't have to do any joins here because I already prepared all data that I need in step 1.

All temp tables will have the same structure:
create table tmp_wellington (t1_name varchar2(30), t2_id number, t4_name varchar2(50), t4_value varchar2(4000))
create table tmp_auckland (t1_name varchar2(30), t2_id number, t4_name varchar2(50), t4_value varchar2(4000))

My intention here was to join and do a full scan on 4 main tables just once and split them into smaller sets of data. So in second step when reading from temp tables I would do full scan only for the records that I need more or less (except when we have 2 records in t2 for t1) and I can use parallel hint.

The problem I encountered here is that statement INSERT FIRST WHEN x INTO a WHEN y INTO b ... has a limit and I can insert into only about 120 tables (6 columns each) with 1 statement. Because of that I still need to do a full scan of t1,t2,t3,t4 3 times(120*3=360, for 350 records in t2)
Each full scan of joined tables took about 8 mins, so 24 mins in total to insert into temp tables and then additional 16 mins to insert from temp tables into flattened tables.

So comparing option1 and option2 it took 55min for option 1 and 24+16=40mins for option 2. Option 2 is faster but I wonder if I could improve performance more.

3) Option 3: Does not work completely fine or is not as dynamical as i expect it to be.
My 3rd option that I tried was to use option 2 but with a different approach when inserting into temp tables. I was trying to avoid doing a full scan 3 times and instead do it once, then process records in plsql and insert them into temp tables using bulk collect and forall once we reached write limit(variable l_write_limit in the code below) number of records for particular table.
Please note that here I wanted to use FORALL to insert into table but I can't achieve to use dynamic table. If I use commented FORALL(see code below) then I get error "PLS-00457: expressions have to be of SQL types"
Just to test performance I manually created all FORALL statements with IF conditions.

The performance of this one was slightly better and comparing to option 2 when 3 times full scan took 24 mins the approach in option 3 took 16 minutes so option 3 would take approximately 16(loading into temp tables)+16(temp tables to flattened tables) =32 mins in total which is still better than option 1 and 2.

Example for option 3, loading data into temp tables only:
All temp tables will have the same structure:
create table tmp_wellington (t1_name varchar2(30), t2_id number, t4_name varchar2(50), t4_value varchar2(4000))
create table tmp_auckland (t1_name varchar2(30), t2_id number, t4_name varchar2(50), t4_value varchar2(4000))

DECLARE
CURSOR c_evid
IS
SELECT t1.name t1_name, t2.id t2_id, t4.name t4_name, t4.VALUE t4_value
FROM t1,t2,t3,t4
WHERE t1.id = t2.t1_id
AND t2.id = t3.t2_id
AND t3.id = t4.t3_id;
CURSOR c_tbl_struct
IS
SELECT t1.name t1_name, t2.id t2_id
FROM t1,t2
WHERE t1.id = t2.t1_id;
TYPE t_evid IS TABLE OF c_evid%ROWTYPE
INDEX BY PLS_INTEGER;
TYPE tr_data IS RECORD(rec_cnt NUMBER
,t1_name VARCHAR2(30)
,tbl_row t_evid);
TYPE taa_data IS TABLE OF tr_data
INDEX BY PLS_INTEGER;
--
l_from t_evid;
l_to taa_data;
k NUMBER;
l_etvd_idx NUMBER;
l_rec_cnt NUMBER;
l_fetch_limit NUMBER := 1000;
l_write_limit NUMBER := 500;
l_last_row NUMBER;
l_tmp_cnt NUMBER := 0;
BEGIN
dbms_output.put_line('------------------------------------------');
dbms_output.put_line('Copying started at ' || TO_CHAR(SYSDATE, 'hh24:mi:ss'));
-- initialize collection
FOR r_tbl_struct IN c_tbl_struct
LOOP
l_to(r_tbl_struct.t2_id).rec_cnt := 0;
l_to(r_tbl_struct.t2_id).t1_name := r_tbl_struct.t1_name;
END LOOP;
-- start reading data
OPEN c_evid;
LOOP
FETCH c_evid
BULK COLLECT INTO l_from
LIMIT l_fetch_limit;
dbms_output.put_line('Fetched ' || l_from.COUNT);
EXIT WHEN l_from.COUNT = 0;
-- read data from cursor into collections
FOR i IN l_from.FIRST .. l_from.LAST
LOOP
l_etvd_idx := l_from(i).t2_id;
l_rec_cnt := l_to(l_etvd_idx).rec_cnt + 1;
l_to(l_etvd_idx).tbl_row(l_rec_cnt) := l_from(i);
l_to(l_etvd_idx).rec_cnt := l_rec_cnt;
IF l_rec_cnt >= l_write_limit THEN
dbms_output.put_line(
'Ready to write. l_etvd = ' || l_etvd_idx || ' cnt = ' || l_rec_cnt || ', etc = ' || l_to(l_etvd_idx).t1_name);
IF l_to(l_etvd_idx).t1_name = 'WELLINGTON' then
FORALL j IN INDICES OF l_to(l_etvd_idx).tbl_row
INSERT INTO tmp_wellington -- this should be dynamic
VALUES l_to(l_etvd_idx).tbl_row(j);
ELSIF l_to(l_etvd_idx).t1_name = 'AUCKLAND' then
FORALL j IN INDICES OF l_to(l_etvd_idx).tbl_row
INSERT INTO tmp_auckland -- this should be dynamic
VALUES l_to(l_etvd_idx).tbl_row(j);
END IF;
--FORALL j IN INDICES OF l_to(l_etvd_idx).tbl_row
-- EXECUTE IMMEDIATE 'INSERT INTO '||l_to(l_etvd_idx).t1_name||' VALUES :1'
-- using l_to(l_etvd_idx).tbl_row(1);
l_tmp_cnt := l_tmp_cnt + l_rec_cnt;
l_to(l_etvd_idx).rec_cnt := 0;
l_to(l_etvd_idx).tbl_row.delete;
END IF;
END LOOP;
END LOOP;
CLOSE c_evid;
k := l_to.FIRST;
WHILE k IS NOT NULL
LOOP
l_tmp_cnt := l_tmp_cnt + l_to(k).rec_cnt;
IF l_to(k).rec_cnt > 0 THEN
dbms_output.put_line(
'Ready to write - last. etvd_id = ' || k || ', etc = ' || l_to(k).t1_name || ', rec_cnt = ' || l_to(k).rec_cnt);
-- here has to be forall for all the plsql structures that still have records in it
IF l_to(k).t1_name = 'WELLINGTON' then
FORALL j IN INDICES OF l_to(k).tbl_row
INSERT INTO tmp_wellington -- this should be dynamic
VALUES l_to(k).tbl_row(j);
ELSIF l_to(k).t1_name = 'AUCKLAND' then
FORALL j IN INDICES OF l_to(k).tbl_row
INSERT INTO tmp_auckland -- this should be dynamic
VALUES l_to(k).tbl_row(j);
END IF;
END IF;
k := l_to.NEXT(k);
END LOOP;
dbms_output.put_line('All records in plsql structures: ' || l_tmp_cnt);
END;

Sorry for a really long post and I hope it makes sense.

My questions are:
What would be your recommended(fastest) approach to complete the task that I have to do?
Which approach from my listed options do you think is the best and could they be improved in some way?
I am using hints APPEND and PARALLEL to make things faster. Do you think that is good or is there a better way? At the moment i am using PARALLEL without a number. Do you think I should use PARALLEL with number e.g. PARALLEL(8)?
Would COMMITS between migration speed up the process, because redo logs would be smaller?

Your help is much appreciated.
Thank you.
Regards,
Branko

and Connor said...

OK, here's my thoughts - apologies for the delay, I've been at conferences all over the place :-)

1) I would get all my objects created

So I would query just t1/t2 and get my AUCKLAND & WELLINGTON tables etc all created and ready to go

2) Divide up the big table

Obviously we dont want to be repeatedly scanning that 2billion row table, but we'll obviously need to do it once.

So I'd do something like this:

create table COPY_OF_2BILLION
( tag varchar2(30, <the existing columns> )
partition by list ( tag )
(
partition p_auckland values ('AUCKLAND'),
partition p_wellington values ('WELLINGTON'),
partition p_hamilton values ('HAMILTON'),
partition p_teawamutu values ('TEAWAMUTU')
...
)

And now (without pivoting etc) we do

insert /+ APPEND PARALLEL */ into COPY_OF_2BILLION
select /*+ PARALLEL */ t1.table_name, t4.*
from t1, t2, t3, t4
where <join conditions>

Now we have our 2 billions segmented out into chunks appropriate for each table

3) Now for each table, we use the relevant partition to pivot/load, eg

insert /+ APPEND */ into WELLINGTON
select t4.*
from t1, t2, t3, COPY_OF_2BILLION partition ( WELLINGTON) t4
where <join conditions>
PIVOT (...)

I havent done these parallel, because I would control the parallelism yourself, namely, you can run WELLINGTON, AUCKLAND, etc in parallel.

All of which should be relatively easy to generate dynamically once into script files, and then run either serially or parallel etc etc.

How does that sound ?

Rating

  (10 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Branko Pinter, December 15, 2015 - 9:25 pm UTC

Thank you for the reply.
In the meantime I developed option 4 which is very similar to your approach but I was using separate temp tables for each city instead of 1 partitioned table. I didn't think of partitioning.
Great advice. Thank you.

DBMS_PARALLEL_EXECUTE unequal workload

Branko Pinter, December 21, 2015 - 2:33 am UTC

Hi,
I implemented your suggested solution and I have to say it's about 25% faster using a partitioned table. Many thanks for that.

I have another question now.
I am trying to execute the migration from partitioned table into flattened table in parallel using DBMS_PARALLEL_EXECUTE.

The issue I have is for example I am using parallel degree of 4 but then the workload is not equally split between jobs. Is there a way to achieve similar distribution of work load between jobs?
Table is chunked by unique number from table t2 using procedure create_chunks_by_sql.
Executing task as:
dbms_parallel_execute.run_task(l_task_name, l_sql_stmt, dbms_sql.native, parallel_level => 4).
So I have 350 chunks but when I look at the report then it look like this:
job_name | time in sec | number of processed chunks
TASK$_11206_2 | 0.141264 | 3
TASK$_11206_3 | 8.070009 | 280
TASK$_11206_1 | 2.043907 | 59
TASK$_11206_4 | 0.445033 | 12

Why did job 3 execute 280 chunks and job 2 only 3?
If I run it again numbers will be different but not significantly.

Thank you for your help.
Connor McDonald
December 21, 2015 - 4:39 am UTC

What I meant by:

"I havent done these parallel, because I would control the parallelism yourself"

Is that I would use (for example) dbms_job and control the parallelism using distinct objects, eg

set job_queue_proceesses to (say) 8

submit a job for WELLINGTON
submit a job for AUCKLAND
submit a job for HAMILTON
etc etc etc

Or shell script will do the trick just as well




No partitioning

Branko Pinter, January 14, 2016 - 8:31 pm UTC

Hi,

I have another question.
In your response to my first question you recommended using partitioning to scan 2B records only once and insert them into partitioned table.
The problem I have is that DBA's told me we don't have partitioning license.
What would be your next best approach for point 2 since I can't use partitioning and to maintain performance.

Thank you.
Connor McDonald
January 15, 2016 - 2:26 am UTC

If you got billions of rows... you should *get* a partitioning license :-)

But putting that discussion aside, then you resort to "partition view" which was the predecessor to partitioning in the version 7 days, but still works now.

eg

create table tmp_AUCKLAND ( unpivoted_cols);
create table tmp_WELLINGTON( unpivoted_cols);
create table tmp_HAMILTON( unpivoted_cols);
etc

and load them all in 1 pass (or as many as you can - there are some limits here) using multi-table insert, ie

insert /*+ APPEND */ first
when tag = 'auckland' then
into auckland values ( ... )
when tag = 'wellington' then
into wellingtonvalues ( ... )
select /*+ PARALLEL */ t1.table_name, t4.*
from t1, t2, t3, t4
where <join conditions>

and then proceed from there.

Optimize for heavy data load

A reader, August 09, 2022 - 1:56 pm UTC

Greetings!

We are performing data migration from source table into target table based on some logic. It will be around ~28 billion rows that will be loaded into the target table.
Source table is partitioned on eff_dt_start and is daily partition.
Target table is partitioned on eff_dt_start and is weekly partition.

Data available in source table is from June 2007.

When we run this insert query, noticed that its consuming heavy amount of temp usage (7TB).

1. Do you have any suggestion on how to optimize this data insert query to reduce the temp usage and make it efficient ?

2. Would it help if we process the data insert with smaller chunk probably by month or year at a time so that less amount of dataset will be filtered and loaded ?

If it helps, How do we process in smaller chunks with this query ?


Thanks.


Insert query:

===============


INSERT /*+ APPEND PARALLEL(32) */ INTO curs.f_cur_point_yield(cur_entity_id,
rate_side,
batch_id,
par_yield,
benchmark_yield,
zero_yield,
forward_rate,
discount_factor,
tenr_id,
tenr_date,
discount_margin,
z_spread,
rec_dt_start,
rec_dt_end,
eff_dt_start,
eff_dt_end,
tenr_name,
primary_milestone,
eds_orig,
rds_orig,
asset_swap_spread)
SELECT /*+ PARALLEL(32) */ cur_id,
rate_side,
batch_id,
par_yield,
benchmark_yield,
zero_yield,
forward_rate,
discount_factor,
tenr,
tenr_date,
discount_margin,
z_spread,
sys_extract_utc(SYSTIMESTAMP),
TO_TIMESTAMP('31-DEC-9999 00:00:00:000000', 'DD-Mon-RR HH24:MI:SS:FF'),
UTC_eff_dt_start,
UTC_eff_dt_start + INTERVAL '01' SECOND,
tenr_name,
'1600NY',
NULL,
NULL,
0
FROM (
SELECT /*+ PARALLEL(32) */ fcp.*, tn.tenr_short_desc AS tenr_NAME,
SYS_EXTRACT_UTC(from_tz(TO_TIMESTAMP(TO_CHAR(eff_dt_start, 'dd-MON-yyyy')) + INTERVAL '16' hour , 'America/New_York')) AS UTC_eff_dt_start,
ROW_NUMBER() OVER (PARTITION BY fcp.cur_id, fcp.tenr, TRUNC(fcp.eff_dt_start) ORDER BY fcp.eff_dt_start DESC) AS rn
FROM curhist.f_cur_point_yield fcp
INNER JOIN curs.tenr_names tn
ON tn.tenr_id = fcp.tenr
WHERE fcp.eff_dt_start < TIMESTAMP '2021-12-06 00:00:00'
AND fcp.eff_dt_end > sys_extract_utc(SYSTIMESTAMP)
AND fcp.rec_dt_end > sys_extract_utc(SYSTIMESTAMP)
)
WHERE rn = 1;

Connor McDonald
August 10, 2022 - 3:32 am UTC

ROW_NUMBER() OVER (PARTITION BY fcp.cur_id, fcp.tenr, TRUNC(fcp.eff_dt_start) ORDER BY fcp.eff_dt_start DESC) AS rn

means we have work out a sort/sequence order for some data. That means sorting...which is most likely why you are seeing the temp segment use.

Add in parallel, and now you have 32/64 processes all doing sorting and that's a LOT of temp work going on.

A lot of this will depend on the data - since you are (in the row_number function) partitioning by trunc(eff_dt_start), you could be doing this load 1 *target* partition at a time and still be ok in terms of avoiding the dups per day.

Thus you could do something like

INSERT /*+ APPEND  */ INTO curs.f_cur_point_yield partition ( partition_for_1st_week )
select ...
from
(
SELECT ...FROM curhist.f_cur_point_yield fcp
INNER JOIN curs.tenr_names tn
ON tn.tenr_id = fcp.tenr
WHERE fcp.eff_dt_start < TIMESTAMP '...'   -- pick these two to align with the target week
and fcp.eff_dt_start > TIMESTAMP '...'     -- above
AND fcp.eff_dt_end > sys_extract_utc(SYSTIMESTAMP)
AND fcp.rec_dt_end > sys_extract_utc(SYSTIMESTAMP)
)
WHERE rn = 1;
 


which means only a week of data to sort. Then in multiple sessions, you could have

INSERT /*+ APPEND  */ INTO curs.f_cur_point_yield partition ( partition_for_2nd_week )



INSERT /*+ APPEND  */ INTO curs.f_cur_point_yield partition ( partition_for_3rd_week )


and so forth, you are manually achieving parallel loading into each target partition

Hope that makes sense

Optimize for heavy data load

Rajeshwaran, Jeyabal, August 10, 2022 - 4:24 pm UTC

Also given this
ROW_NUMBER() OVER (PARTITION BY fcp.cur_id, fcp.tenr, TRUNC(fcp.eff_dt_start) ORDER BY fcp.eff_dt_start DESC) AS rn
FROM curhist.f_cur_point_yield fcp
INNER JOIN curs.tenr_names tn
ON tn.tenr_id = fcp.tenr
WHERE fcp.eff_dt_start < TIMESTAMP '2021-12-06 00:00:00'
AND fcp.eff_dt_end > sys_extract_utc(SYSTIMESTAMP)
AND fcp.rec_dt_end > sys_extract_utc(SYSTIMESTAMP)
)
WHERE rn = 1;

luckily all the columns in the "partition by" and "order by" clause of analytical function all comes from FCP table.

rather than doing a join, apply analytics, and filter data based on analytics

can we do something like this ? apply analytics, and filter data based on analytics and then do a join on that subset.

(hope that subset will be less volume compared to the resultset produced by the join)

like this
select *
from (
  select *
  from (
   select fcp.*,
    ROW_NUMBER() OVER (PARTITION BY fcp.cur_id, fcp.tenr, 
     TRUNC(fcp.eff_dt_start) ORDER BY fcp.eff_dt_start DESC) AS rn 
   from curhist.f_cur_point_yield fcp  
    ) fcp 
  where rn = 1  
  ) fcp, curs.tenr_names tn
where tn.tenr_id = fcp.tenr
WHERE fcp.eff_dt_start < TIMESTAMP '2021-12-06 00:00:00'
AND fcp.eff_dt_end > sys_extract_utc(SYSTIMESTAMP)
AND fcp.rec_dt_end > sys_extract_utc(SYSTIMESTAMP)

Connor McDonald
August 11, 2022 - 4:30 am UTC

nice input

Optimize for heavy data load

A reader, August 16, 2022 - 3:09 am UTC

Thanks for your suggestion.

Looks like we may have to do this with smaller chunks of data between time ranges.

In order to perform this, I believe we need to map the source partition (daily partition) and target partitions (weekly) and map it and store in the table
and process it in the loop.

If possible, Could you please show the PL/SQL code logic or procedure on how to generate the partitions (source/target partition mapping ) seamlessly in a for loop to perform this ?

INSERT /*+ APPEND */ INTO curs.f_cur_point_yield partition ( partition_for_1st_week )
select ...
from
(
SELECT ...FROM curhist.f_cur_point_yield fcp
INNER JOIN curs.tenr_names tn
ON tn.tenr_id = fcp.tenr
WHERE fcp.eff_dt_start < TIMESTAMP '...' -- pick these two to align with the target week
and fcp.eff_dt_start > TIMESTAMP '...' -- above
AND fcp.eff_dt_end > sys_extract_utc(SYSTIMESTAMP)
AND fcp.rec_dt_end > sys_extract_utc(SYSTIMESTAMP)
)
WHERE rn = 1;


--
Thanks!
Connor McDonald
August 17, 2022 - 4:17 am UTC

You can do this to convert the LONG high value column to a string

select 
   x.table_name         
  ,x.partition_name     
  ,x.high_value         
  ,x.partition_position 
FROM  XMLTABLE('/ROWSET/ROW'
PASSING (
  SELECT DBMS_XMLGEN.GETXMLTYPE(
   'select  table_name,partition_name,high_value,partition_position from user_tab_partitions'
  ) FROM dual
)
COLUMNS 
   table_name         varchar2(128)
  ,partition_name     varchar2(128)
  ,high_value         varchar2(256)
  ,partition_position number
) x


Once you do that, check out this video which shows how to use that string in PLSQL to build date/timestamp expressions based on this high value



Armed with that - you now know the date ranges for each target partition which become inputs into your query for timestamp ranges.

Optimize for heavy data load

A reader, August 18, 2022 - 2:56 am UTC

Hi Connor

Thank you for sharing the Video on "reading Long column" technique in SQL. This is quite useful.

Thanks!
Connor McDonald
August 23, 2022 - 10:36 pm UTC

Glad we could help

partition key column name change with Data type change, data copy and application transparency

Tim, January 11, 2023 - 2:21 am UTC

Happy New Year!

Need your suggestion on this. Database is on 19.11 on Exadata platform.

1. We have a partition table (about 40 billion rows) which stores historical and current data from the year 2000 to present. So about 22 years of data exists.

We have a requirement to rename the partition key column with data type change :

Here, SNP_DATE is the current partition key column of DATE datatype and will be renamed to EFF_DATE_START (of TIMESTAMP datatype) . So we are setting up a new partition table with all schema changes in place and then copy the data from old table to new table.

Question:

1. How to perform data migration seamlessly while application is using the table ? Would creating materialized view log on the source table and incrementally refresh the materialized would be the choice to consider here. On the day of cutover, we want to make the new table to old name. We do have flexibility that, Historical data can be copied in stages from old to new table.

2. I was thinking of "DBMS_REDEF" but With DBMS_REDEFINITION approach, I think we won't able to copy in incremental fashion like year by year partitions of data from source to target table. Its either complete data copy or nothing.

3. When the data copied from DATE column to TIMESTAMP column, Does Oracle implicitly perform target data conversion or we need to use TO_TIMESTAMP(DATE column). Can you please show how to do ?

4. On application side, all the SQL queries/data loads refers old column,, Any idea what would be the best and less risky approach to make these changes transparent to application code/users .?
If not an option, then we need to make 100's of queries code changes on application side.

One transparent approach is using VIEWs (hide the new column as old column) and present it to the user. Is there any performance issue or drawback with this view approach like predicate push ..


Any thoughts ? Thanks for your suggestions/recommendations on this in advance.
Connor McDonald
October 16, 2023 - 1:41 am UTC

There's an easy to way to convert DATE to TIMESTAMP :-)

SQL> create table t ( x date, y int, z int );

Table T created.

SQL>
SQL> insert into t
  2  select date '2020-01-01' + rownum, rownum, rownum
  3  from dual
  4* connect by level <= 1000;

1,000 rows inserted.

SQL> commit;

Commit complete.

SQL>
SQL> alter table t modify x timestamp;

Table T altered.

Re: partition key column name change with Data type change, data copy and application transparency

Narendra, January 12, 2023 - 5:32 pm UTC

@Tim,

I am sure Chris/Connor would provide a better response to your queries but I found your questions interesting so decided to share my thoughts. You are right that there does not appear to be an "easy-and-quick" way to achieve this.
Are you planning to change data type in a way so that existing data will be retained "as is" but application can/needs to populate future data that includes fraction of seconds being stored? If not, then you might want to consider using a virtual column of TIMESTAMP data type and can use new feature since 18c that allows to change partitioning column using DDL and even with ONLINE option.
The only way I can think of achieving this is to add new column of TIMESTAMP data type to the existing table; populate new column with value of existing column and this can be done using SQL in multiple batches, with each batch targeting specific partitions, starting from the oldest one to the latest one (assuming old data is not being changed) and then use online partition modification option.
As long as application does not need to modify historical data, this should be achievable with minimal disruption.
Using view to present the new column with name of old column appears to be sound (only??) option.
If not already seen, you might be interested in going through a similar discussion here couple of years ago
https://asktom.oracle.com/pls/apex/asktom.search?tag=partition-by-range-with-timestamp-column#9544780700346803820

Hope this helps.
Connor McDonald
January 13, 2023 - 4:55 am UTC

Thanks for adding some input. More community contribution = a better community.

Re: partition key column name change with Data type change, data copy and application transparency

Narendra, January 16, 2023 - 12:04 am UTC

@Tim

One more point.

2. I was thinking of "DBMS_REDEF" but With DBMS_REDEFINITION approach, I think we won't able to copy in incremental fashion like year by year partitions of data from source to target table. Its either complete data copy or nothing.

Have you explored the option of using PART_NAME parameter of DBMS_REDEFINITION APIs? The documentation appears to suggest that one can use DBMS_REDEFINITION on selected partitions. Not sure if there are any quirks in using batched partition redefinition but if this works as documented then might be the solution that you are looking for. Of course, you will need to think about how to "present" these tables while the redefinition is "in progress" but I guess you already know that it is a question that you need to answer eventually. Hope this helps (if you decide to visit this).

More to Explore

VLDB

If you are new to partitioning, check out Connor McDonald's introduction series here.

VLDB

Documentation set on VLDB and Partitioning.