You Asked

Hi Tom,
We have table with 450 million rows in it. As one time activity, we copy it in lower environment and mask some of its columns (9 varchar2 columns). Masking happens through function call, it's logic cannot be changed as it is used across many heterogeneous systems. What is the best way to achieve this?

Example of UPDATE stmt:
UPDATE T
SET COL1=PKG.MASKFUNC(COL1),
COL2=PKG.MASKFUNC(COL2),
COL3=PKG.MASKFUNC(COL3),
....
Where PK BETWEEN STARTING_VALUE AND END_VALUE;
COMMIT;

This runs in loop until all rows are updated. Around 10,000 rows in single update. This took 4 days last time & want to reduce run time.

More on column data:
6 of the columns are of varchar2(400)
2 columns are of varchar2(500)
One column is varchar2(1000). Data in these columns are non-unique and may contain nulls.

Approach tried:
Mask distinct values of each column and insert into separate tables.
E.g.
INSERT /*+APPEND*/INTO MSK_COL1
WITH TEMP AS (SELECT /*+ PARALLEL(8) */ COL1 FROM T WHERE COL1 IS NOT NULL GROUP BY COL1)
SELECT COL1,MASKFUNC(COL1) AS MSKCOL1 FROM TEMP;

Similar stmts for remaining 8 columns.

CREATE UNIQUE INDEX IDX1 ON MSK_COL1 (COL1);
Exec dbms_stats.gather_table_stats(user,'T',estimate_percent=>1);
Similar stmts for remaining 8 tables.

Final insert:
INSERT /*+ APPEND */ INTO T1
SELECT /*+PARALLEL(8)*/
....
,CASE WHEN COL1 IS NOT NULL THEN
(SELECT MSKCOL1 FROM MSK_COL1 WHERE A.COL1=COL1)
ELSE NULL
END
,....similar for remaining 8 columns
...
FROM T A;

Idea here is to rename table T to T_old and T1 to T.
Tested this approach on 10 million rows, final insert itself took 72 minutes, simple extrapolation says 54 hours.
No index exists on columns being masked on table T. Table T is partitioned on Date, interval one day. T has 117 columns in it.

Please suggest.

and Connor said...

Have you tried DBMS_PARALLEL_EXECUTE ?

There's an example here

https://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:4248554900346593542

Also, do some tracing on a subset of the data - see *where* the time is being consumed. If you are burning lots of CPU in "MASKFUNC", then perhaps you need to focus attention there (or see if there is any benefit in using result_cache etc).

Rating

(3 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Suggestion

Ghassan, June 11, 2016 - 6:59 am UTC

Make the function deterministic
Use the result cache hint and a parallel hint if possible in the select stmt.
Create table as select ..fct (),..,fct ()

Otherwise Merge stmt could be a solution with the rowid hint for target also by using parallelism

Also if partitioned table try prunning and splitting by sets. ..

logic inside this function PKG.MASKFUNC

Rajeshwaran, Jeyabal, June 11, 2016 - 2:41 pm UTC

What sort of logic you have inside this function "PKG.MASKFUNC" ?

is that something a simple sql sitting inside that function, that inturn do look-up on Dimension Tables?

A reader, June 16, 2016 - 4:45 pm UTC

Explored dbms_parallel_execute along with result cache on mask function. Observed latch free wait events with latching on result cache. However, I figured out from the data that out of nine columns I mentioned in previous post, only one column needs to be masked and rest can be derived from this one column. As a result, I will be running mask function only once per row against previous logic of running it nine times per row. Was able to finish masking on 86 million rows in 85 mins.

Thanks for your inputs.

June 17, 2016 - 2:46 am UTC

Glad it worked out well

More to Explore

Hints

Need more information on hints? Check out the complete hint descriptions

Breadcrumb

Question and Answer

You Asked

and Connor said...

Rating

Comments

Suggestion

logic inside this function PKG.MASKFUNC

More to Explore

Hints

Update on table with 450 million rows

Breadcrumb

Question and Answer

You Asked

and Connor said...

Rating

Comments

Suggestion

logic inside this function PKG.MASKFUNC

More to Explore

Hints