Home>Question Details



Gleisson -- Thanks for the question regarding "Data Masking", version 9.2.0.1.0

Submitted on 2-Jan-2006 23:17 Central time zone
Last updated 11-Nov-2009 12:53

You Asked

Does Oracle provide a package or function for data masking ? For example, in a 
development environment, for data protection purposes, the information of the table 
customer needs to be masked. 

create table customer (last_name varchar2(25), first_name varchar2(25), address 
varchar2(25));
insert into customer values('Doe', 'John', '10 someroad st'); 

In the production environment, if we do a select on customer, the result would be without 
any surprise: 

select *
from customer;

LAST_NAME      FIRST_NAME     ADDRESS
-------------- -------------- --------------------
Doe            John           10 someroad st

In the development environment, the same information needs to be masked for something 
like:

LAST_NAME      FIRST_NAME     ADDRESS
-------------- -------------- --------------------
Ahd            Uiea           55 xxxx ue

It doesn't really need to make sense, it doesn't need to be readable. Just needs to be 
masked. Does Oracle supply anything for that purpose at all ? I tried to find but didn't 
have much success. I found a third party software that will do the job. If you want to, I 
can post its name here. 

Thanks Tom and happy 2006.

Gleisson Henrique 

and we said...

No, we can "hide" the information (Oracle 9i provides for row hiding and Oracle 10g 
provided for row and/or column hiding)

But data obfuscation - no, it cannot do that.  You actually need to physically obscure 
the data itself prior to storage (and it had better be a one way obfuscation).  Think 
about things like indexes and such.

Don't forget, as soon as you start obscuring data, you start to loose the natural 
distribution of data.  Do not be surprised when your obfuscated system behaves entirely 
differently from your "clear text" system - clustering factors, data lengths, 
distributions of values - all will be radically different.


I'll post to see if anyone has a suggestion on a tool to use sparingly on names/addresses 
and the like. 

Reviews    
3 stars Data Masking   January 3, 2006 - 8am Central time zone
Reviewer: David from UK
A most excellent tool for data masking may be found at the following :

www.datamasker.com

This software provides a simple repeatable method of scrambling data. 


5 stars Thanks Tom.   January 3, 2006 - 9am Central time zone
Reviewer: Gleisson Henrique from Washington, DC USA
I will look in the documentation for the row hiding in 9i that you mentioned to see how it works 
and what it does. I found this link that might work for my purposes.

http://asktom.oracle.com/pls/ask/f?p=4950:8:12084722728803631128::NO::F4950_P8_DISPLAYID,F4950_P8_CR
ITERIA:6382864029082
Analytics really *rock* but I agree with Robert on that post, Analytics are hard to read. But the 
way, I saw on some  page on AskTom that a reader suggested that you should publish a book on 
Analytics. That's actually a good idea. What do you say ? 

I didn't want to advertise on the AskTom, but since you said you would post to see if anyone knows 
any tool that might do that, I found one last night after I posted the question, and it's called 
The Data Masker. I watched the tutorial and it looks like it works pretty well.  


Followup   January 3, 2006 - 9am Central time zone:

That would kill performance (using analytics to shift data), you want to obscure it - physically 
scramble it - for testing purposes.

the row/column hiding is Fine Grained Access Control. And I don't think it applies to this problem.

And I don't mind "refering to software that solves a problem" on here at all. 

5 stars That was quick ! ! ! ! ! !   January 3, 2006 - 9am Central time zone
Reviewer: Gleisson Henrique from Washington, DC USA
Yeah, I started to look into row hiding and is not really what I thought it was.  


5 stars   January 3, 2006 - 9am Central time zone
Reviewer: Rajiv from VA, US
Why not write a couple of pl/sql functions using dbms_random, to just scramble the data while 
loading to development (either through sqlldr or select insert over dblink/whatever..) Thats pretty 
much what the tool also would do behind the scenes? 


5 stars   January 3, 2006 - 9am Central time zone
Reviewer: Jon 
> A most excellent tool for data masking may be found at 
> the following :

> www.datamasker.com

> This software provides a simple repeatable method of 
> scrambling data. 

Do you know what the price is?  

Does it install packages in Oracle and perform the data masking there? 


3 stars Shift columns in PL/SQL table   January 3, 2006 - 12pm Central time zone
Reviewer: Dejan from Slovenia
Hi,

We have the same requirement of scrambling test data and we did the following:

 - Bulk load "customer data" into PL/SQL table
 - in PL/SQL table shift column "last_name" 1 up, 
   column first_name 1 down ..
 - Update original customer table ..

Regards Dejan


    


3 stars DBMS_Obfuscation_Toolkit   January 4, 2006 - 5am Central time zone
Reviewer: Will from UK
Hi Tom, 

I used the Oracle obfuscation package successfully before after reading your Expert One-on-One 
book.  select/insert/update operations were all carried out using stored procedures which made 
implementation easy.  I just wondered if you had a reason for not suggesting it this time?

Would it be possible to combine this with FGAC?  If so could we then store the data in its 
naturally readable form, but obfuscate it, maybe with an obfuscated view, and depending on the 
querying user?  

Best regards
Will 


Followup   January 4, 2006 - 9am Central time zone:

data distributions.... (it is not an easy problem)
data widths change....

This data masking for testing is not a "trivial problem" to solve. 

5 stars Book on Analytics   January 4, 2006 - 10am Central time zone
Reviewer: Michael T. from Dallas, Tx USA
I know on a couple of threads you have been asked about, or mentioned thinking about writing a book 
on analytics.  I think a Cookbook style book, e.g. Perl Cookbook, on analytics would be terrific.  
It would be chock full of examples and each example would be explained, so even if a reader 
couldn't find an exact match they would most likely be able to use one the examples as a starting 
point for their solution. 


3 stars Book on analytics   January 4, 2006 - 11am Central time zone
Reviewer: Jonathan Gennick from Munising, Michigan, USA
A couple of people have mentioned a desire for a book on analytics. I hope I'm not too out of line 
mentioning this, but Anthony Molinaro uses analytics extensively in his just-published SQL 
Cookbook. I was at times worried that he leaned on them too heavily (I was his editor). Anyway, if 
you want a source of examples, that book is a good possibility.  


3 stars Are views an option?   January 4, 2006 - 12pm Central time zone
Reviewer: MarkB from Birmingham, England
Could you not mask the required columns behind a view?

Columns that are not to be shown can either by totally omitted or replaced with something like
'DETAILS NOT ALLOWED - CONTACT HELPDESK'

or masked out 

'****'||SUBSTR(customer.address,LENGTH(customer.address)-3)

However, this still leaves the "base table" visible, so permissions would need to be changed 
appropriately. 


Followup   January 4, 2006 - 1pm Central time zone:

data distribution problems.
indexes are not used problems..
updates, inserts - problems...

if the goal was to test - this would not permit a useful test. 

5 stars Police data   January 5, 2006 - 8am Central time zone
Reviewer: Connor from Mandurah
We have a similar problem (we record police criminal records and want to populate our test systems 
with realistic data). 

You need to find a balance between "masking" and "realism".  For example, we maintain a table which 
contains all column names with a tag of: "do nothing", "wipe", "scramble" where
- wipe means null out the data (eg images etc)
- do nothing means leave the data untouched (typically dates, numeric columns etc)
- scramble means mask the data

We also have some smarts in there about constraints and index maintenance options which I won't 
bore you with.

Our masking algorithm has the characteristics of:

- retain the first character (to keep some semblance of the data distribution)
- case is not retained (except for first character)
- length is retained

with the scrambling done with an old fashioned substitution cipher, with a different substitution 
array for each of the possible leading characters of the source column.

Now that of course makes things a lot more hackable, (eg, "Smith" is a popular name so (eg) "Sxy@f" 
in the masked data probably means "smith".  But we're not trying to make our data *impossible* to 
decode, we're simply trying to make sure our development/testing teams don't "stumble upon" 
information they were not meant to see.  Many of them have legitimate access to the unscrambled 
production data anyway (that access is obviously more finely audited).  Ultimately, if you don't 
trust your developers, then you're dead in the water anyway - they could simply easter egg the code 
to get whatever production data they needed.

Also, the fact that the cipher is 1-to-1 gives benefits that are probably worth the loss of 
masking.  For example, if a user has a problem when they so (say) an operation on: "Johnson", we 
can let the test team know to attempt that same operation in our test environment on (say): 
"Jr8w8dhq" to see if they can replicate the problem.  The moment you go with some more "hard core" 
scrambling, you will lose that facility.

Another plus for a simple cipher method is because its also a balance between the amount of 
scrambling you do and the resources required to actually scramble that data.  With substitution, we 
can build a 2-dimensional array package variable which contains all the substitutions and simply 
issue:

update table
set col1 = translate(col1, the_standard_ascii_table,
                           the_substition_table(ascii(col1))

so the scrambling is down with a single SQL rather than a mish-mash of PL/SQL and unload/reload 
operations.

hth
Connor 


5 stars Don't scramble - mix'n'match   February 14, 2006 - 5pm Central time zone
Reviewer: Isaac Blank from San Francisco, CA, USA
We once used a following method of obfuscating personal data:  We would randomly swap column values 
between different rows, so if you have John Brown, living on 123 Main St., Columbus, OH and Jane 
Dow from 567 Broadway, Manahattan, NY , they become John Doe, 567 Broadway, Columbus, NY and Jane 
Brown, 123 Main St., Manhattan, OH.  That way data distribution for individual columns stays 
intact.  Composite indexes are, of course, affected, but it gave us a pretty good database to play 
with 


3 stars   August 8, 2006 - 8am Central time zone
Reviewer: A reader 


4 stars Looking for mix'n'match solution   September 26, 2006 - 3pm Central time zone
Reviewer: siddu from Los Angeles, CA USA
We are looking at a mix'n'match solution, like the one posted by Isaac Blank above. What is the 
best way to achieve this? We tested using the cursor logic, and it is very slow. We have over 20 
million Customer records. I appreciate any help on this.  


Followup   September 26, 2006 - 5pm Central time zone:

ok, it does seem that Isaac therefore told you what to do?  what more do you need to get started?

this is not an algorithm I really want to play with "too much".  scrambling data is "hard", 
requires some thought about YOUR data and what makes sense.

so, what are you scrambling here? and why? and what does it look like? 

5 stars Re: Looking for mix'n'match solution   September 27, 2006 - 12am Central time zone
Reviewer: Siddu from Los Angeles, CA USA
I'm looking for some logic or SQL to shuffle the Names and addresses across the rows with better 
performance than using cursor. We want to scramble the Names, Addresses, SSN, DOB, Phone Numbers 
and Email Addresses. They all are stored in different tables. We do not want any sensitive data to 
be available outside the production systems. At the same time we want to have meaningful data 
available for testing and other needs. 

We want to shuffle the data randomly in Name and Address tables. We want to scramble the SSN, Phone 
Number and Email addresses. Add few days to date fields. 
 


Followup   September 27, 2006 - 4am Central time zone:

it is going to be slow regardless of the approach here, think of the work involved for a table of 
any size.

say the table has 5 columns you want to "scramble", we'd have to sort that table randomly 5 times 
in order to scramble it up.  Trying to do something in a bulk sql statement is going to be hard - 
since you cannot predicably get functions to be called over and over and over

so, say you wanted to "scramble" the owner and object_name column from all_objects, you might be 
able to do something like this - but it will require hitting all_objects three times into temp and 
then joining the results:

create table t
as
with main_data
as
(
select  rownum r,
        object_id,
        subobject_name,
        data_object_id,
        object_type,
        created
   from all_objects
),
new_object_names
as
(select object_name,
        row_number() over (order by dbms_random.random) r
   from all_objects
),
new_owners
as
(select owner,
        row_number() over (order by dbms_random.random) r
   from all_objects
)
select object_id, subobject_name,
       data_object_id, object_type,
       created, object_name, owner
  from main_data a, new_object_names b, new_owners c
 where a.r = b.r and a.r = c.r
/
 

5 stars   December 4, 2007 - 3am Central time zone
Reviewer: karthick from India
Data masking cannot be generic. It must be application specific. The tool that masks the data must 
know the meaning of those data that it¿s masking. 

When you say you want to give your production data which is masked for testing and you want your 
application to be tested with that data. That data must be a valid data it should satisfy all the 
rules that the application applies on the data.

The masker should know the nature of the information that it¿s handling. 

It should identify all constrains on the data and the masked information should satisfy them. By 
constraints I don¿t mean database constraints alone. Application level constraints. Even though 
constraints should exist in the data base this world has the dirty habit of putting it in different 
places. 

What if there is a front end restriction that salary cannot be > 1000 for manager and your masked 
data has salary 2000 for manager. How is this going to be handled?

Think about indexes what if there is a unique index in a column you should ensure that the masked 
data is unique. What if you have a function based index on a column what way it will get affected 
when you mask the data used by that index.

So when it¿s masking you are not masking just the data in your database but the data in the 
application. Thought at application level is very important for this.

As tom pointed out some technical problem in this, we actually have bigger functional and logical 
problems too


3 stars Masking different sources with same algorithm ?   October 28, 2009 - 10am Central time zone
Reviewer: A reader 
Hi - We are right now designing a 11g data warehouse database for a BI application. This database 
is going to have data loaded from different sources. And a lot of them will have the same 
information for a person like SSN, Name, etc. We have a masking requirement but the trick here is 
that for a person's SSN, it should be the same value across all sources for eg., if a person's SSN 
is 012345678 then the resultant masked value should be 555667777 from all sources. For a lot of the 
algorithms, they are not repeatable in the sense, if I run it once on 012345678 it will give 
555667777 but another time in another source it will give 999883333. Then in the data warehouse 
final database it might seem that these people are different. Has anyone come across this before ? 
What have you used to get over this drawback ? We are also looking at the new masking from OEM, can 
that be used in any way to overcome this ?

Any help is appreciated.

Thanks.


3 stars Regarding masking different sources   October 29, 2009 - 10am Central time zone
Reviewer: A reader 
I have posted a question above regarding masking different sources. Can you please reply ?


Followup   October 29, 2009 - 10am Central time zone:

I am not personally aware of anything like that. You'd have to keep a mapping yourself to do something like this.
3 stars Masking from multiple sources   October 29, 2009 - 2pm Central time zone
Reviewer: A reader 
What is the common way that people do this ? I am surprised that I am not getting much responses 
because I thought that this would be common for any BI application.


Followup   October 29, 2009 - 3pm Central time zone:

It is common for testing scenarios - take this production data and mask it so we can test the set of functionality.

But for BI - the rule has always seemed to be that either

a) you are allowed to see the data
b) you are not allowed to see the data


If there was sensitive information in the base data, it would just not be exposed to me. For testing a production application - that isn't feasible (the application relies on seeing that data) but for a BI application that certainly is feasible - you are either allowed to see or NOT allowed to see it.

So, I haven't heard this request myself before, it goes against what I know.
3 stars Other ideas for SSN   October 30, 2009 - 3pm Central time zone
Reviewer: Jack from Little Rock, AR USA
A reader in the comment 3 prior said "For a lot of the algorithms, they are not repeatable in the 
sense, if I run it once on 012345678 it will give 555667777 but another time in another source it 
will give 999883333."

What algorithms?  For "a lot of the algorithms", they ARE repeatable (aka deterministic).  Here is 
one example, ssn_masked = (ssn + 1).

What is the point of masking the SSN?

If it is to render the SSN useless except as a join key (indicator of whether or not "the people 
are different"), then I would suggest a one-way key-hashed Message Authentication Code (MAC).  
Oracle provides dbms_crypto.mac.  Here is an example:


SQL> col ssn for a40
SQL>
SQL> select id,
  2         dbms_crypto.mac(
  3                        utl_raw.cast_to_raw(ssn),
  4                        2, -- hmac_sh1
  5                        utl_raw.cast_to_raw('a 32 byte secret key that only you know')
  6                        ) ssn
  7  from (
  8       select 1 id, '111223333' ssn from dual
  9       union all
 10       select 2 id, '444556666' ssn from dual
 11       union all
 12       select 3 id, '444556666' ssn from dual
 13       union all
 14       select 4 id, '111223333' ssn from dual
 15       union all
 16       select 5 id, '777889999' ssn from dual
 17       union all
 18       select 6 id, '111223333' ssn from dual
 19       )
 20  ;

              ID SSN
---------------- ----------------------------------------
               1 79022167C1ABEB8E56504517F38AC9B22AAAEA65
               2 14A1B87278AE44FC5087245E70090751DFEC8F53
               3 14A1B87278AE44FC5087245E70090751DFEC8F53
               4 79022167C1ABEB8E56504517F38AC9B22AAAEA65
               5 716B2D9C3420E6AC9D9E3B90B712C9356F0AD29B
               6 79022167C1ABEB8E56504517F38AC9B22AAAEA65

6 rows selected.


Now someone may use SSN to know if "the people are different," but not for anything else (e.g. 
identity theft).  Note, this hash is not reversible even by you (it is a one-way hash, not 
encryption).

Of course this changes the length of SSN from 9 to 20 bytes (raw).  It is not a simple "mask" as 
discussed earlier in this thread.  A simple mask is not going to prevent improper use of SSN.

You might also consider:

b) Maintain your own mapping to a surrogate value.  This will require more work to load a row to 
your data warehouse because you have to look up the surrogate value (or generate a new one if you 
haven't seen the SSN before).  This approach does allow you to have a 9 byte value for the 
surrogate.

c) Don't show SSN (better).

d) Don't put SSN in your data warehouse (best).


- Jack


Followup   November 9, 2009 - 11am Central time zone:

problem is, the applications are expecting a SSN. You cannot mask data in general by changing the type, that will break the application. The goal of data masking usually is to permit QA and Development staff to play with the application without exposing to them 'sensitive' data.

I agree with (d) and (c) - for their warehouse, they should not be exposing this information to anyone that is not allowed to see it - period. It either should not be there OR should be hidden (via dbms_rls) from the view of those that are not allowed to see it.

4 stars ďo not mask data in BI/DWH if possible   November 5, 2009 - 11am Central time zone
Reviewer: Jakub Illner from Prague
My experience is that it is extremely difficult and costly to mask data in BI/DWH applications and 
that it is almost always better to use unmasked data with correctly enforced access privileges and 
auditing than to try to mask the data.

Why: in BI/DWH you take data from source systems, apply some transformations, possibly use some 
staging tables and then you load data to the target table. After that you might populate aggregated 
tables or OLAP cubes. Then you publish data to the analytical environment.

So the data you need to test are not in one place but in several layers. It is not possible to mask 
just one layer but you have to mask all of them. But the data on the input (source data) might be 
in different format than the target data, so how could you mask them? And you cannot mask just the 
input data since the input is matched with the content stored in the system. And for some types of 
data extraction (e.g. Streams) you cannot mask data at all.

Additionaly a transformation or check which works on non masked data will not work on masked data. 
For example we have a rule that personal number (which is unique per citizen) must be divisible by 
11. Or there are rules how IBAN numbers are checksumed. Or the prefix of a phone number is used to 
identify country and operator etc. etc. All these rules will be broken if you mask the data, unless 
you use it in a very sophisticated way which preserves the rules.

And then we have data cleansing activities, which are quite difficult by itself. We try to 
standardize names, compare them with lists of known names and produce clean data. And do the same 
with addresses. And maybe car license plates etc. This cannot be done with masked data. And suppose 
you have a rule that beginning balance + sum of transactions = ending balance, how could you mask 
account balances and still be able to apply this check?

And, of course, you have to consider how to actually create a test/dev environment. Since the 
environment should be as close to production as possible (for good testing), most often the test 
environment is created as a clone of the production database, maybe with some subsetting applied 
(e.g. historical partitions are dropped).

So my recommendation is not to mask the data but spend the effort on having good access rights and 
auditing in place. And if you really need to hide some sensitive data like credit card numbers, 
then do it very selectively, applying column specific data masking rules. And be aware of the 
impacts.

And a final note - it is much better if the data model normalizes the sensitive information into 
one place (e.g. customer dimension, card dimension). If you use SSN or card numbers as keys which 
are all over the place, then to mask this information will be a nightmare. It is better to use 
surrogate keys.

--
Jakub


Followup   November 11, 2009 - 12pm Central time zone:

My experience is that it is extremely difficult and costly to mask data in
BI/DWH applications and that it is almost always better to use unmasked data
with correctly enforced access privileges and auditing than to try to mask the
data.



agreed 100%

Write a Review
 


All information and materials provided here are provided "as-is"; Oracle disclaims all express and implied warranties, including, the implied warranties of merchantability or fitness for a particular use. Oracle shall not be liable for any damages, including, direct, indirect, incidental, special or consequential damages for loss of profits, revenue, data or data use, incurred by you or any third party in connection with the use of this information or these materials.

About Oracle | Legal Notices and Terms of Use | Privacy Statement