You Asked
Hi Tom,
I have a task to migrate data from 4 hierarchical tables into few hundred flattened tables. I have developed some options but am looking for possible improvements or more effective ways.
So I have 4 tables let's say:
create table t1(id number, name varchar2(20));
create table t2(id number, t1_id number, xmlmetadata clob);
create table t3(id number, t2_id number);
create table t4(id number, t3_id number, name varchar2(40), value varchar2(4000));
insert into t1 values(10,'WELLINGTON');
insert into t1 values(11,'AUCKLAND');
insert into t2 values(100,10,'<Attribute>firstname</Attribute><Attribute>lastname</Attribute>');
insert into t2 values(101,10,'<Attribute>firstname</Attribute><Attribute>gender</Attribute>');
insert into t2 values(102,11,'<Attribute>firstname</Attribute><Attribute>salary</Attribute>');
insert into t3 values(1001,100);
insert into t3 values(1002,100);
insert into t3 values(1003,101);
insert into t3 values(1004,102);
insert into t4 values(5001,1001,'firstname','John');
insert into t4 values(5002,1001,'lastname','Smith');
insert into t4 values(5003,1002,'firstname','Mark');
insert into t4 values(5004,1002,'gender','M');
insert into t4 values(5005,1004,'firstname','Alex');
insert into t4 values(5005,1004,'salary','50000');
Table T1 is parent of T2, T2 is parent of T3 and T3 is parent of T4. Tables have indexes and constraints. E.g. Column t1_id from table t2 has a foreign key to table t1 column id and so on for other tables.
Number of records expected in production in each table:
T1 = 400 records, T2 = 500 records, T3 = 250 Million records, T4 = 2 Billion records.
Please note that I have tested this in tables of size
T1 = 300, T2 = 350, T3 = 7M, T4 = 45M records.
Each record in T1 represents 1 flattened table. Columns of each table are defined in XMLMETADATA from T2.
Records from T3 and T4 are then joined together and inserted into correct table according to the parent id's.
The data from table T4 are in format of key-value and they will be transformed into horizontal/flattened table. E.g. multiple records from table T4 will become 1 record in new flattened table.
So the above example would result in 2 flattened tables:
Table WELLINGTON with columns "firstname", "lastname" and "gender" and table AUCKLAND with columns "firstname" and "salary".
Flattened data of table WELLINGTON would look like:
FIRSTNAME | LASTNAME | GENDER
John | Smith | null
Mark | null | M
Table AUCKLAND:
FIRSTNAME | SALARY
Alex | 50000
Long story short: I have to join all tables, pivot records from T4 and insert them into a new flattened table (FT). Name of the table is defined in T1, columns used are defined in T2 and data is from t3,t4.
I have developed the following solutions:
1) Option 1
For each record in T2 I created a dynamic SQL statement like:
INSERT into X SELECT Y from (select A from t1,t2,t3,t4 pivot ABC)
Please note that rules for pivoting are dynamic and are defined in table T2.
After each record in T2 I do a commit.
This is 1 example of my dynamic sql query. This is just to get the idea what am i doing in sql:
INSERT /*+ APPEND */ INTO MSDDET10_JN (evidenceid,evidencetype,evidencetypeversiondefid,versionno,STARTDATE_,ENDDATE_,CHILDORADDITIONALOCCUPANT_,GENDER_,DATEOFBIRTH_,TOTALDAYSINCARE_,OUTOFWEEKS_,CSOIND_,TYPEOFPAYMENTORTYPEOFCHILD_,jn_seq,jn_flag,jn_datetime)
(SELECT /*+ PARALLEL */ t2.*, iap_seq.NEXTVAL jn_seq, 'A' jn_flag, to_date('30/10/2015 16:03:09', 'dd/mm/yyyy hh24:mi:ss') jn_datetime FROM
(SELECT t1.* FROM (SELECT deda.evidenceid,etd.evidencetypecode evidencetype, ded.evidencetypeversiondefid, ded.versionno, deda.name,deda.VALUE
FROM dynamicevidencedata ded,dynamicevidencedataattribute deda,evidencetypedef etd,evidencetypeversiondef etvd
WHERE ded.evidenceid = deda.evidenceid AND ded.evidencetypeversiondefid = etvd.evidencetypeversiondefid AND etvd.evidencetypedefid = etd.evidencetypedefid
AND ded.evidencetypeversiondefid = 47570)
PIVOT (MAX(VALUE) FOR name IN ('startDate' STARTDATE_,'endDate' ENDDATE_,'childOrAdditionalOccupant' CHILDORADDITIONALOCCUPANT_,'gender' GENDER_,'dateOfBirth' DATEOFBIRTH_,'totalDaysInCare' TOTALDAYSINCARE_,'outOfWeeks' OUTOFWEEKS_,'csoInd' CSOIND_,'typeOfPaymentOrTypeOfChild' TYPEOFPAYMENTORTYPEOFCHILD_)) t1) t2)
This approach took about 55 mins to do the transformation/migration of my testing data (T4 = 45M records)
My concern with this approach is that in production where there will be 2B+ records, I have to join all 4 tables 500 times and I am afraid this is going to take ages even though I have indexes.
Even though tables have indexes it will sometimes use index when result set is small and do a full table scan when result set is big. Please correct me if I am wrong.
2) Option 2:
I tried to split these 4 tables into 350 temp tables so that each table would have all data that needs to be inserted into flattened table. In the second step I read temp table, do the pivot and insert into flattened tables.
Steps:
- using insert first when t1.name = 'AUCKLAND' then into tmp_auckland ... when t1.name = 'WELLINGTON' then into tmp_wellington ... and so on, select A,B,C from t1,t2,t3,t4 where <join condition>.
- then I read each record in t2 to get table structure(to be able to create dynamic sql with pivot) and insert data from temp tables into flattened tables. I create a dynamic statement similar to option1: insert into x select A from (pivot temp tables). I don't have to do any joins here because I already prepared all data that I need in step 1.
All temp tables will have the same structure:
create table tmp_wellington (t1_name varchar2(30), t2_id number, t4_name varchar2(50), t4_value varchar2(4000))
create table tmp_auckland (t1_name varchar2(30), t2_id number, t4_name varchar2(50), t4_value varchar2(4000))
My intention here was to join and do a full scan on 4 main tables just once and split them into smaller sets of data. So in second step when reading from temp tables I would do full scan only for the records that I need more or less (except when we have 2 records in t2 for t1) and I can use parallel hint.
The problem I encountered here is that statement INSERT FIRST WHEN x INTO a WHEN y INTO b ... has a limit and I can insert into only about 120 tables (6 columns each) with 1 statement. Because of that I still need to do a full scan of t1,t2,t3,t4 3 times(120*3=360, for 350 records in t2)
Each full scan of joined tables took about 8 mins, so 24 mins in total to insert into temp tables and then additional 16 mins to insert from temp tables into flattened tables.
So comparing option1 and option2 it took 55min for option 1 and 24+16=40mins for option 2. Option 2 is faster but I wonder if I could improve performance more.
3) Option 3: Does not work completely fine or is not as dynamical as i expect it to be.
My 3rd option that I tried was to use option 2 but with a different approach when inserting into temp tables. I was trying to avoid doing a full scan 3 times and instead do it once, then process records in plsql and insert them into temp tables using bulk collect and forall once we reached write limit(variable l_write_limit in the code below) number of records for particular table.
Please note that here I wanted to use FORALL to insert into table but I can't achieve to use dynamic table. If I use commented FORALL(see code below) then I get error "PLS-00457: expressions have to be of SQL types"
Just to test performance I manually created all FORALL statements with IF conditions.
The performance of this one was slightly better and comparing to option 2 when 3 times full scan took 24 mins the approach in option 3 took 16 minutes so option 3 would take approximately 16(loading into temp tables)+16(temp tables to flattened tables) =32 mins in total which is still better than option 1 and 2.
Example for option 3, loading data into temp tables only:
All temp tables will have the same structure:
create table tmp_wellington (t1_name varchar2(30), t2_id number, t4_name varchar2(50), t4_value varchar2(4000))
create table tmp_auckland (t1_name varchar2(30), t2_id number, t4_name varchar2(50), t4_value varchar2(4000))
DECLARE
CURSOR c_evid
IS
SELECT t1.name t1_name, t2.id t2_id, t4.name t4_name, t4.VALUE t4_value
FROM t1,t2,t3,t4
WHERE t1.id = t2.t1_id
AND t2.id = t3.t2_id
AND t3.id = t4.t3_id;
CURSOR c_tbl_struct
IS
SELECT t1.name t1_name, t2.id t2_id
FROM t1,t2
WHERE t1.id = t2.t1_id;
TYPE t_evid IS TABLE OF c_evid%ROWTYPE
INDEX BY PLS_INTEGER;
TYPE tr_data IS RECORD(rec_cnt NUMBER
,t1_name VARCHAR2(30)
,tbl_row t_evid);
TYPE taa_data IS TABLE OF tr_data
INDEX BY PLS_INTEGER;
--
l_from t_evid;
l_to taa_data;
k NUMBER;
l_etvd_idx NUMBER;
l_rec_cnt NUMBER;
l_fetch_limit NUMBER := 1000;
l_write_limit NUMBER := 500;
l_last_row NUMBER;
l_tmp_cnt NUMBER := 0;
BEGIN
dbms_output.put_line('------------------------------------------');
dbms_output.put_line('Copying started at ' || TO_CHAR(SYSDATE, 'hh24:mi:ss'));
-- initialize collection
FOR r_tbl_struct IN c_tbl_struct
LOOP
l_to(r_tbl_struct.t2_id).rec_cnt := 0;
l_to(r_tbl_struct.t2_id).t1_name := r_tbl_struct.t1_name;
END LOOP;
-- start reading data
OPEN c_evid;
LOOP
FETCH c_evid
BULK COLLECT INTO l_from
LIMIT l_fetch_limit;
dbms_output.put_line('Fetched ' || l_from.COUNT);
EXIT WHEN l_from.COUNT = 0;
-- read data from cursor into collections
FOR i IN l_from.FIRST .. l_from.LAST
LOOP
l_etvd_idx := l_from(i).t2_id;
l_rec_cnt := l_to(l_etvd_idx).rec_cnt + 1;
l_to(l_etvd_idx).tbl_row(l_rec_cnt) := l_from(i);
l_to(l_etvd_idx).rec_cnt := l_rec_cnt;
IF l_rec_cnt >= l_write_limit THEN
dbms_output.put_line(
'Ready to write. l_etvd = ' || l_etvd_idx || ' cnt = ' || l_rec_cnt || ', etc = ' || l_to(l_etvd_idx).t1_name);
IF l_to(l_etvd_idx).t1_name = 'WELLINGTON' then
FORALL j IN INDICES OF l_to(l_etvd_idx).tbl_row
INSERT INTO tmp_wellington -- this should be dynamic
VALUES l_to(l_etvd_idx).tbl_row(j);
ELSIF l_to(l_etvd_idx).t1_name = 'AUCKLAND' then
FORALL j IN INDICES OF l_to(l_etvd_idx).tbl_row
INSERT INTO tmp_auckland -- this should be dynamic
VALUES l_to(l_etvd_idx).tbl_row(j);
END IF;
--FORALL j IN INDICES OF l_to(l_etvd_idx).tbl_row
-- EXECUTE IMMEDIATE 'INSERT INTO '||l_to(l_etvd_idx).t1_name||' VALUES :1'
-- using l_to(l_etvd_idx).tbl_row(1);
l_tmp_cnt := l_tmp_cnt + l_rec_cnt;
l_to(l_etvd_idx).rec_cnt := 0;
l_to(l_etvd_idx).tbl_row.delete;
END IF;
END LOOP;
END LOOP;
CLOSE c_evid;
k := l_to.FIRST;
WHILE k IS NOT NULL
LOOP
l_tmp_cnt := l_tmp_cnt + l_to(k).rec_cnt;
IF l_to(k).rec_cnt > 0 THEN
dbms_output.put_line(
'Ready to write - last. etvd_id = ' || k || ', etc = ' || l_to(k).t1_name || ', rec_cnt = ' || l_to(k).rec_cnt);
-- here has to be forall for all the plsql structures that still have records in it
IF l_to(k).t1_name = 'WELLINGTON' then
FORALL j IN INDICES OF l_to(k).tbl_row
INSERT INTO tmp_wellington -- this should be dynamic
VALUES l_to(k).tbl_row(j);
ELSIF l_to(k).t1_name = 'AUCKLAND' then
FORALL j IN INDICES OF l_to(k).tbl_row
INSERT INTO tmp_auckland -- this should be dynamic
VALUES l_to(k).tbl_row(j);
END IF;
END IF;
k := l_to.NEXT(k);
END LOOP;
dbms_output.put_line('All records in plsql structures: ' || l_tmp_cnt);
END;
Sorry for a really long post and I hope it makes sense.
My questions are:
What would be your recommended(fastest) approach to complete the task that I have to do?
Which approach from my listed options do you think is the best and could they be improved in some way?
I am using hints APPEND and PARALLEL to make things faster. Do you think that is good or is there a better way? At the moment i am using PARALLEL without a number. Do you think I should use PARALLEL with number e.g. PARALLEL(8)?
Would COMMITS between migration speed up the process, because redo logs would be smaller?
Your help is much appreciated.
Thank you.
Regards,
Branko
and Connor said...
OK, here's my thoughts - apologies for the delay, I've been at conferences all over the place :-)
1) I would get all my objects created
So I would query just t1/t2 and get my AUCKLAND & WELLINGTON tables etc all created and ready to go
2) Divide up the big table
Obviously we dont want to be repeatedly scanning that 2billion row table, but we'll obviously need to do it once.
So I'd do something like this:
create table COPY_OF_2BILLION
( tag varchar2(30, <the existing columns> )
partition by list ( tag )
(
partition p_auckland values ('AUCKLAND'),
partition p_wellington values ('WELLINGTON'),
partition p_hamilton values ('HAMILTON'),
partition p_teawamutu values ('TEAWAMUTU')
...
)
And now (without pivoting etc) we do
insert /+ APPEND PARALLEL */ into COPY_OF_2BILLION
select /*+ PARALLEL */ t1.table_name, t4.*
from t1, t2, t3, t4
where <join conditions>
Now we have our 2 billions segmented out into chunks appropriate for each table
3) Now for each table, we use the relevant partition to pivot/load, eg
insert /+ APPEND */ into WELLINGTON
select t4.*
from t1, t2, t3, COPY_OF_2BILLION partition ( WELLINGTON) t4
where <join conditions>
PIVOT (...)
I havent done these parallel, because I would control the parallelism yourself, namely, you can run WELLINGTON, AUCKLAND, etc in parallel.
All of which should be relatively easy to generate dynamically once into script files, and then run either serially or parallel etc etc.
How does that sound ?
Rating
(10 ratings)
Is this answer out of date? If it is, please let us know via a Comment