Thanks for the question, David.
Asked: August 28, 2012 - 10:42 am UTC
Last updated: October 11, 2012 - 9:28 am UTC
Version: 10.2.0.2.0
Viewed 1000+ times
You Asked
Hi Tom,
i am reviewing a new table design and a question has come to mind about when to use and when not to use check constraints .
The new table is designed to hold data from an external vendor. Some of the columns are to hold values that are enumerated types. (e.g. values 1-20 mapped to account types or a status field with only two possible values documented) The developer in this case has made an explicit decision not to constrain these fields - putting a comment stating as much in the table definition file. One argument I have heard to justify this is that since we do not create the data through one of our own batches or screens we cannot control the quality. The vendor has been known to violate their own published standards in the past (data integrity / keeping documentation up to date does not seem to be their strong suit.) Check constraint will cause the insert to fail when there is bad data and that makes capturing the data more difficult. All the data must be captured from the external vendor.
I am uneasy with this. it seems to me that if the data needs to actually be stored in a table, integrity checking should be performed and any load failures should be logged and dealt with immediately. Perhaps an error table could be used for any bad records.
What is your view? Can you suggest an efficient approach for maintaining data integrity while being able to capture all records from an external vendor who is potentially lax about updating data standards documentation.
Thanks very much.
David.
and Tom said...
This is an interface table - you are using it to hold data from an external source, and then you will process this data.
Now, in this interface table, you have an understanding of what the data should look like, what the rules are. For example - column X contains a number between 1 and 20.
Now, if someone stuffs 42 in there - what happens downstream? What will that bad value do? Might it corrupt data, break an application, cause a failure in some apparently unrelated bit of code? Maybe - sure, it could.
Now, you have two choices - either
a) use an interface table consisting of nothing but VARCHAR2(4000) columns. Remember - numbers and dates are *data integrity constraints*. If all data must be captured - the use of anything other than varchar2(4000) doesn't make sense (and even then, you might consider a clob). the fact is, if they cannot be bound to give you a number between 1 and 20, they cannot be bound to give you a number.
and then you have a program that processes the data from this interface table into a constrained table that has your idea of what it means to be 'good data' - logging errors into an error table for post processing (this can be accomplished with DML error logging - a single insert into as select)
b) put constraints on the interface table, reject bad data at the source, do not load it - log it, post process it (again, DML error logging would be useful here)
Is this answer out of date? If it is, please let us know via a Comment