Data Masking
January 3, 2006 - 8am Central time zone
Reviewer: David from UK
A most excellent tool for data masking may be found at the following :
www.datamasker.com
This software provides a simple repeatable method of scrambling data.
Thanks Tom.
January 3, 2006 - 9am Central time zone
Reviewer: Gleisson Henrique from Washington, DC USA
I will look in the documentation for the row hiding in 9i that you mentioned to see how it works
and what it does. I found this link that might work for my purposes.
http://asktom.oracle.com/pls/ask/f?p=4950:8:12084722728803631128::NO::F4950_P8_DISPLAYID,F4950_P8_CR
ITERIA:6382864029082
Analytics really *rock* but I agree with Robert on that post, Analytics are hard to read. But the
way, I saw on some page on AskTom that a reader suggested that you should publish a book on
Analytics. That's actually a good idea. What do you say ?
I didn't want to advertise on the AskTom, but since you said you would post to see if anyone knows
any tool that might do that, I found one last night after I posted the question, and it's called
The Data Masker. I watched the tutorial and it looks like it works pretty well.
Followup January 3, 2006 - 9am Central time zone:
That would kill performance (using analytics to shift data), you want to obscure it - physically
scramble it - for testing purposes.
the row/column hiding is Fine Grained Access Control. And I don't think it applies to this problem.
And I don't mind "refering to software that solves a problem" on here at all.
That was quick ! ! ! ! ! !
January 3, 2006 - 9am Central time zone
Reviewer: Gleisson Henrique from Washington, DC USA
Yeah, I started to look into row hiding and is not really what I thought it was.

January 3, 2006 - 9am Central time zone
Reviewer: Rajiv from VA, US
Why not write a couple of pl/sql functions using dbms_random, to just scramble the data while
loading to development (either through sqlldr or select insert over dblink/whatever..) Thats pretty
much what the tool also would do behind the scenes?

January 3, 2006 - 9am Central time zone
Reviewer: Jon
> A most excellent tool for data masking may be found at
> the following :
> www.datamasker.com
> This software provides a simple repeatable method of
> scrambling data.
Do you know what the price is?
Does it install packages in Oracle and perform the data masking there?
Shift columns in PL/SQL table
January 3, 2006 - 12pm Central time zone
Reviewer: Dejan from Slovenia
Hi,
We have the same requirement of scrambling test data and we did the following:
- Bulk load "customer data" into PL/SQL table
- in PL/SQL table shift column "last_name" 1 up,
column first_name 1 down ..
- Update original customer table ..
Regards Dejan
DBMS_Obfuscation_Toolkit
January 4, 2006 - 5am Central time zone
Reviewer: Will from UK
Hi Tom,
I used the Oracle obfuscation package successfully before after reading your Expert One-on-One
book. select/insert/update operations were all carried out using stored procedures which made
implementation easy. I just wondered if you had a reason for not suggesting it this time?
Would it be possible to combine this with FGAC? If so could we then store the data in its
naturally readable form, but obfuscate it, maybe with an obfuscated view, and depending on the
querying user?
Best regards
Will
Followup January 4, 2006 - 9am Central time zone:
data distributions.... (it is not an easy problem)
data widths change....
This data masking for testing is not a "trivial problem" to solve.
Book on Analytics
January 4, 2006 - 10am Central time zone
Reviewer: Michael T. from Dallas, Tx USA
I know on a couple of threads you have been asked about, or mentioned thinking about writing a book
on analytics. I think a Cookbook style book, e.g. Perl Cookbook, on analytics would be terrific.
It would be chock full of examples and each example would be explained, so even if a reader
couldn't find an exact match they would most likely be able to use one the examples as a starting
point for their solution.
Book on analytics
January 4, 2006 - 11am Central time zone
Reviewer: Jonathan Gennick from Munising, Michigan, USA
A couple of people have mentioned a desire for a book on analytics. I hope I'm not too out of line
mentioning this, but Anthony Molinaro uses analytics extensively in his just-published SQL
Cookbook. I was at times worried that he leaned on them too heavily (I was his editor). Anyway, if
you want a source of examples, that book is a good possibility.
Are views an option?
January 4, 2006 - 12pm Central time zone
Reviewer: MarkB from Birmingham, England
Could you not mask the required columns behind a view?
Columns that are not to be shown can either by totally omitted or replaced with something like
'DETAILS NOT ALLOWED - CONTACT HELPDESK'
or masked out
'****'||SUBSTR(customer.address,LENGTH(customer.address)-3)
However, this still leaves the "base table" visible, so permissions would need to be changed
appropriately.
Followup January 4, 2006 - 1pm Central time zone:
data distribution problems.
indexes are not used problems..
updates, inserts - problems...
if the goal was to test - this would not permit a useful test.
Police data
January 5, 2006 - 8am Central time zone
Reviewer: Connor from Mandurah
We have a similar problem (we record police criminal records and want to populate our test systems
with realistic data).
You need to find a balance between "masking" and "realism". For example, we maintain a table which
contains all column names with a tag of: "do nothing", "wipe", "scramble" where
- wipe means null out the data (eg images etc)
- do nothing means leave the data untouched (typically dates, numeric columns etc)
- scramble means mask the data
We also have some smarts in there about constraints and index maintenance options which I won't
bore you with.
Our masking algorithm has the characteristics of:
- retain the first character (to keep some semblance of the data distribution)
- case is not retained (except for first character)
- length is retained
with the scrambling done with an old fashioned substitution cipher, with a different substitution
array for each of the possible leading characters of the source column.
Now that of course makes things a lot more hackable, (eg, "Smith" is a popular name so (eg) "Sxy@f"
in the masked data probably means "smith". But we're not trying to make our data *impossible* to
decode, we're simply trying to make sure our development/testing teams don't "stumble upon"
information they were not meant to see. Many of them have legitimate access to the unscrambled
production data anyway (that access is obviously more finely audited). Ultimately, if you don't
trust your developers, then you're dead in the water anyway - they could simply easter egg the code
to get whatever production data they needed.
Also, the fact that the cipher is 1-to-1 gives benefits that are probably worth the loss of
masking. For example, if a user has a problem when they so (say) an operation on: "Johnson", we
can let the test team know to attempt that same operation in our test environment on (say):
"Jr8w8dhq" to see if they can replicate the problem. The moment you go with some more "hard core"
scrambling, you will lose that facility.
Another plus for a simple cipher method is because its also a balance between the amount of
scrambling you do and the resources required to actually scramble that data. With substitution, we
can build a 2-dimensional array package variable which contains all the substitutions and simply
issue:
update table
set col1 = translate(col1, the_standard_ascii_table,
the_substition_table(ascii(col1))
so the scrambling is down with a single SQL rather than a mish-mash of PL/SQL and unload/reload
operations.
hth
Connor
Don't scramble - mix'n'match
February 14, 2006 - 5pm Central time zone
Reviewer: Isaac Blank from San Francisco, CA, USA
We once used a following method of obfuscating personal data: We would randomly swap column values
between different rows, so if you have John Brown, living on 123 Main St., Columbus, OH and Jane
Dow from 567 Broadway, Manahattan, NY , they become John Doe, 567 Broadway, Columbus, NY and Jane
Brown, 123 Main St., Manhattan, OH. That way data distribution for individual columns stays
intact. Composite indexes are, of course, affected, but it gave us a pretty good database to play
with

August 8, 2006 - 8am Central time zone
Reviewer: A reader
Looking for mix'n'match solution
September 26, 2006 - 3pm Central time zone
Reviewer: siddu from Los Angeles, CA USA
We are looking at a mix'n'match solution, like the one posted by Isaac Blank above. What is the
best way to achieve this? We tested using the cursor logic, and it is very slow. We have over 20
million Customer records. I appreciate any help on this.
Followup September 26, 2006 - 5pm Central time zone:
ok, it does seem that Isaac therefore told you what to do? what more do you need to get started?
this is not an algorithm I really want to play with "too much". scrambling data is "hard",
requires some thought about YOUR data and what makes sense.
so, what are you scrambling here? and why? and what does it look like?
Re: Looking for mix'n'match solution
September 27, 2006 - 12am Central time zone
Reviewer: Siddu from Los Angeles, CA USA
I'm looking for some logic or SQL to shuffle the Names and addresses across the rows with better
performance than using cursor. We want to scramble the Names, Addresses, SSN, DOB, Phone Numbers
and Email Addresses. They all are stored in different tables. We do not want any sensitive data to
be available outside the production systems. At the same time we want to have meaningful data
available for testing and other needs.
We want to shuffle the data randomly in Name and Address tables. We want to scramble the SSN, Phone
Number and Email addresses. Add few days to date fields.
Followup September 27, 2006 - 4am Central time zone:
it is going to be slow regardless of the approach here, think of the work involved for a table of
any size.
say the table has 5 columns you want to "scramble", we'd have to sort that table randomly 5 times
in order to scramble it up. Trying to do something in a bulk sql statement is going to be hard -
since you cannot predicably get functions to be called over and over and over
so, say you wanted to "scramble" the owner and object_name column from all_objects, you might be
able to do something like this - but it will require hitting all_objects three times into temp and
then joining the results:
create table t
as
with main_data
as
(
select rownum r,
object_id,
subobject_name,
data_object_id,
object_type,
created
from all_objects
),
new_object_names
as
(select object_name,
row_number() over (order by dbms_random.random) r
from all_objects
),
new_owners
as
(select owner,
row_number() over (order by dbms_random.random) r
from all_objects
)
select object_id, subobject_name,
data_object_id, object_type,
created, object_name, owner
from main_data a, new_object_names b, new_owners c
where a.r = b.r and a.r = c.r
/

December 4, 2007 - 3am Central time zone
Reviewer: karthick from India
Data masking cannot be generic. It must be application specific. The tool that masks the data must
know the meaning of those data that it¿s masking.
When you say you want to give your production data which is masked for testing and you want your
application to be tested with that data. That data must be a valid data it should satisfy all the
rules that the application applies on the data.
The masker should know the nature of the information that it¿s handling.
It should identify all constrains on the data and the masked information should satisfy them. By
constraints I don¿t mean database constraints alone. Application level constraints. Even though
constraints should exist in the data base this world has the dirty habit of putting it in different
places.
What if there is a front end restriction that salary cannot be > 1000 for manager and your masked
data has salary 2000 for manager. How is this going to be handled?
Think about indexes what if there is a unique index in a column you should ensure that the masked
data is unique. What if you have a function based index on a column what way it will get affected
when you mask the data used by that index.
So when it¿s masking you are not masking just the data in your database but the data in the
application. Thought at application level is very important for this.
As tom pointed out some technical problem in this, we actually have bigger functional and logical
problems too
Masking different sources with same algorithm ?
October 28, 2009 - 10am Central time zone
Reviewer: A reader
Hi - We are right now designing a 11g data warehouse database for a BI application. This database
is going to have data loaded from different sources. And a lot of them will have the same
information for a person like SSN, Name, etc. We have a masking requirement but the trick here is
that for a person's SSN, it should be the same value across all sources for eg., if a person's SSN
is 012345678 then the resultant masked value should be 555667777 from all sources. For a lot of the
algorithms, they are not repeatable in the sense, if I run it once on 012345678 it will give
555667777 but another time in another source it will give 999883333. Then in the data warehouse
final database it might seem that these people are different. Has anyone come across this before ?
What have you used to get over this drawback ? We are also looking at the new masking from OEM, can
that be used in any way to overcome this ?
Any help is appreciated.
Thanks.
Regarding masking different sources
October 29, 2009 - 10am Central time zone
Reviewer: A reader
I have posted a question above regarding masking different sources. Can you please reply ?
Followup October 29, 2009 - 10am Central time zone:
I am not personally aware of anything like that. You'd have to keep a mapping yourself to do something like this.
Masking from multiple sources
October 29, 2009 - 2pm Central time zone
Reviewer: A reader
What is the common way that people do this ? I am surprised that I am not getting much responses
because I thought that this would be common for any BI application.
Followup October 29, 2009 - 3pm Central time zone:
It is common for testing scenarios - take this production data and mask it so we can test the set of functionality.
But for BI - the rule has always seemed to be that either
a) you are allowed to see the data
b) you are not allowed to see the data
If there was sensitive information in the base data, it would just not be exposed to me. For testing a production application - that isn't feasible (the application relies on seeing that data) but for a BI application that certainly is feasible - you are either allowed to see or NOT allowed to see it.
So, I haven't heard this request myself before, it goes against what I know.
Other ideas for SSN
October 30, 2009 - 3pm Central time zone
Reviewer: Jack from Little Rock, AR USA
A reader in the comment 3 prior said "For a lot of the algorithms, they are not repeatable in the
sense, if I run it once on 012345678 it will give 555667777 but another time in another source it
will give 999883333."
What algorithms? For "a lot of the algorithms", they ARE repeatable (aka deterministic). Here is
one example, ssn_masked = (ssn + 1).
What is the point of masking the SSN?
If it is to render the SSN useless except as a join key (indicator of whether or not "the people
are different"), then I would suggest a one-way key-hashed Message Authentication Code (MAC).
Oracle provides dbms_crypto.mac. Here is an example:
SQL> col ssn for a40
SQL>
SQL> select id,
2 dbms_crypto.mac(
3 utl_raw.cast_to_raw(ssn),
4 2, -- hmac_sh1
5 utl_raw.cast_to_raw('a 32 byte secret key that only you know')
6 ) ssn
7 from (
8 select 1 id, '111223333' ssn from dual
9 union all
10 select 2 id, '444556666' ssn from dual
11 union all
12 select 3 id, '444556666' ssn from dual
13 union all
14 select 4 id, '111223333' ssn from dual
15 union all
16 select 5 id, '777889999' ssn from dual
17 union all
18 select 6 id, '111223333' ssn from dual
19 )
20 ;
ID SSN
---------------- ----------------------------------------
1 79022167C1ABEB8E56504517F38AC9B22AAAEA65
2 14A1B87278AE44FC5087245E70090751DFEC8F53
3 14A1B87278AE44FC5087245E70090751DFEC8F53
4 79022167C1ABEB8E56504517F38AC9B22AAAEA65
5 716B2D9C3420E6AC9D9E3B90B712C9356F0AD29B
6 79022167C1ABEB8E56504517F38AC9B22AAAEA65
6 rows selected.
Now someone may use SSN to know if "the people are different," but not for anything else (e.g.
identity theft). Note, this hash is not reversible even by you (it is a one-way hash, not
encryption).
Of course this changes the length of SSN from 9 to 20 bytes (raw). It is not a simple "mask" as
discussed earlier in this thread. A simple mask is not going to prevent improper use of SSN.
You might also consider:
b) Maintain your own mapping to a surrogate value. This will require more work to load a row to
your data warehouse because you have to look up the surrogate value (or generate a new one if you
haven't seen the SSN before). This approach does allow you to have a 9 byte value for the
surrogate.
c) Don't show SSN (better).
d) Don't put SSN in your data warehouse (best).
- Jack
Followup November 9, 2009 - 11am Central time zone:
problem is, the applications are expecting a SSN. You cannot mask data in general by changing the type, that will break the application. The goal of data masking usually is to permit QA and Development staff to play with the application without exposing to them 'sensitive' data.
I agree with (d) and (c) - for their warehouse, they should not be exposing this information to anyone that is not allowed to see it - period. It either should not be there OR should be hidden (via dbms_rls) from the view of those that are not allowed to see it.
ďo not mask data in BI/DWH if possible
November 5, 2009 - 11am Central time zone
Reviewer: Jakub Illner from Prague
My experience is that it is extremely difficult and costly to mask data in BI/DWH applications and
that it is almost always better to use unmasked data with correctly enforced access privileges and
auditing than to try to mask the data.
Why: in BI/DWH you take data from source systems, apply some transformations, possibly use some
staging tables and then you load data to the target table. After that you might populate aggregated
tables or OLAP cubes. Then you publish data to the analytical environment.
So the data you need to test are not in one place but in several layers. It is not possible to mask
just one layer but you have to mask all of them. But the data on the input (source data) might be
in different format than the target data, so how could you mask them? And you cannot mask just the
input data since the input is matched with the content stored in the system. And for some types of
data extraction (e.g. Streams) you cannot mask data at all.
Additionaly a transformation or check which works on non masked data will not work on masked data.
For example we have a rule that personal number (which is unique per citizen) must be divisible by
11. Or there are rules how IBAN numbers are checksumed. Or the prefix of a phone number is used to
identify country and operator etc. etc. All these rules will be broken if you mask the data, unless
you use it in a very sophisticated way which preserves the rules.
And then we have data cleansing activities, which are quite difficult by itself. We try to
standardize names, compare them with lists of known names and produce clean data. And do the same
with addresses. And maybe car license plates etc. This cannot be done with masked data. And suppose
you have a rule that beginning balance + sum of transactions = ending balance, how could you mask
account balances and still be able to apply this check?
And, of course, you have to consider how to actually create a test/dev environment. Since the
environment should be as close to production as possible (for good testing), most often the test
environment is created as a clone of the production database, maybe with some subsetting applied
(e.g. historical partitions are dropped).
So my recommendation is not to mask the data but spend the effort on having good access rights and
auditing in place. And if you really need to hide some sensitive data like credit card numbers,
then do it very selectively, applying column specific data masking rules. And be aware of the
impacts.
And a final note - it is much better if the data model normalizes the sensitive information into
one place (e.g. customer dimension, card dimension). If you use SSN or card numbers as keys which
are all over the place, then to mask this information will be a nightmare. It is better to use
surrogate keys.
--
Jakub
Followup November 11, 2009 - 12pm Central time zone:
My experience is that it is extremely difficult and costly to mask data in
BI/DWH applications and that it is almost always better to use unmasked data
with correctly enforced access privileges and auditing than to try to mask the
data.
agreed 100%
|