Back from Oracle OpenWorld
I am back and have been taking a set of new questions on asktom. Last week was a busy one out in California and I'm finally getting caught up on emails and questions (100% on the former, still working on the latter)
Anyway, I saw an article and it made me laugh - and sort of cry at the same time. It has to do with the use of default values. A lot of developers/DBAs have a very certain fear (that is the best word I can think of to describe their attitude) of NULL - the 'unknown' value. So, instead of using NULL for an effective_end_date field (for records that we don't know the end date for, they don't have one) or using NULL for values they do not know the value of - they use some 'fake' value. This fake value is assumed to be a value that could never possibly be used.
But....
Things change over time.
I'm sure when the developers implemented this system - using XXXXXXX for a license plate value that was unknown seemed 'reasonable'. I mean - who would ever ask for a vanity plate with seven X's on them?
Talk to the guy with almost $20k in fines that aren't his to see if he might know someone that might want a vanity plate with XXXXXXX on it :)
Do not fear NULL.
Understand it, but don't fear it.
http://tkyte.blogspot.com/2006/01/something-about-nothing.html
http://tkyte.blogspot.com/2006/01/mull-about-null.html
Anyway, I saw an article and it made me laugh - and sort of cry at the same time. It has to do with the use of default values. A lot of developers/DBAs have a very certain fear (that is the best word I can think of to describe their attitude) of NULL - the 'unknown' value. So, instead of using NULL for an effective_end_date field (for records that we don't know the end date for, they don't have one) or using NULL for values they do not know the value of - they use some 'fake' value. This fake value is assumed to be a value that could never possibly be used.
But....
Things change over time.
I'm sure when the developers implemented this system - using XXXXXXX for a license plate value that was unknown seemed 'reasonable'. I mean - who would ever ask for a vanity plate with seven X's on them?
Talk to the guy with almost $20k in fines that aren't his to see if he might know someone that might want a vanity plate with XXXXXXX on it :)
Do not fear NULL.
Understand it, but don't fear it.
http://tkyte.blogspot.com/2006/01/something-about-nothing.html
http://tkyte.blogspot.com/2006/01/mull-about-null.html
31 Comments:
What a coincidence, I just had read the following article which says that NULL are like plague;
http://www.bennadel.com/blog/85-Why-NULL-Values-Should-Not-Be-Used-in-a-Database-Unless-Required.htm
I listened to your recorded keynote, and it was first time I heard you. It was very pleasant.
@Fahd
and it is quite simply because of blog posts like that - that I feel compelled to write blog posts like this.
Did you see ANY TECHNICAL reasoning in that blog entry? Any at all?
It boils down to:
(quote)
NULL values cause several problems.
(/quote)
followed by - not a list of several problems, but fears, fears that due to a lack of understanding - you'll do it wrong (but isn't that true of EVERYTHING?).
He states "they are not data values" - sure they are, they represent a data value that is quite simply UNKNOWN. To which I say "perfect, since there isn't any other way to represent 'i don't know the value'"
I had to laugh when I read
(quote)
look at the following query where we are trying to find the number of fields that do not have values:
(/quote)
His query counts entirely different things - as he seems to miss the point that "when a column contains a value - it is not null, it is not unknown, when it is null - it is unknown"
the only query to answer his question would be:
select count(*) from t where column is null;
his other queries
"where len(column) = 0", that doesn't find unknown data, that would find data whose length is zero.
"where t.name NOT LIKE '_%'", that doesn't find unknown data - that finds data that exists such that it is not at least one character long.
"WHERE t.name IS NULL OR t.name NOT LIKE '_%'", that finds unknown data PLUS all data that exists (eg: is not unknown) that is not at least one character long.
His comment:
(quote)
... We were looking for 5 as records 4, 5, 7, 8, and 10 do not have values in them....
(/quotes)
shows that he has the classic fear of nulls due to lack of understanding them. If someone asked me to "find the count of rows without a value - with an unknown value" - the answer would have been "there are 2". There is no other answer to that question.
His question however was "find the count of rows without a value or with a specific characteristic - that of being not at least one character long
(quote)
So, allowing NULL values makes you work extra hard to get the kind of data you are looking for.
(/quote)
I hate that attitude. First of all - SO WHAT, you are actually forced to specify what you are looking for - oh boo hoo, woe is me.
It does not make it harder, it does make you be specific (computers are funny that way) about what you want.
(quote)
From a related angle, allowing NULL values reduces your convictions about the data in your database. You can never quite be sure if a value exists or not. Does that make you feel safe and comfortable when programming?
(/quote)
I've never been "not quite sure" myself, but...
How about this twist, I'll rewrite that:
From a related angle, IF allowing NULL values reduces your convictions about the ability of your developers to correctly code an application is a state you are in, you have to ask yourself: Does that make you feel safe and comfortable when allowing them access to a keyboard to do what they call programming?
I do like what he started to close with:
(quote)
Furthermore, while running LEN() on a NULL value doesn't act as you might think it to, it also does NOT throw an error. This will make debugging your code even harder if you do not understand the difference between NULL values and data values.
(/quote)
I think I might have heard that somewhere before, hmmmmmmmmm
That's awesome dissection tom.
Just like Oracle seniors like you raised awareness about Oracle Performance Tuning and helped getting rid of ratio-based tuning, there is a dire need of NULL-Fear Debunking.
This comment has been removed by the author.
@Richard
display is the job of the client application.
display UNK - fine (as long as UNK isn't a value that can possibly appear in that field. Hmmm, wonder if Mr. Thomas J. Unk would like your application? Anothy Platt would not like it http://en.wikipedia.org/wiki/Unk ... There is a university that might have problems with it http://www.unk.edu/ - I loved their mascot - that is funny )
store UNK - totally *bad* idea. See the article referenced in the first post AND this list right above.
What you *display* does not dictate what you should *store*
storing UNK - bad
displaying UNK - less bad.
Why wouldn't the application just display UNK if that is what the application thought best???
Tom,
Of all the people I know, the one who hates nulls most is CJ Date, and while the views I hold are not as extreme as Chris' I must admit he has a point. I have no qualm with the 'currently unknown' meaning of nulls, I have too often seen the "no null" mantra degenerate into EAV (sigh); I fully agree with you that like with everything understanding is key, but I'm not sure I'd agree with too lenient a view of nulls. The fact that they can mean "currently unknown" or "totally irrelevant", among other things, doesn't help. The three-valued logic implied by nulls has at some point to translate into the two-valued logic of returning/processing a row or discarding it, and stumbling in the process is all too easy for the young and innocent. And I'm not sure I'm always comfortable explaining why max() ignores nulls while greatest() doesn't.
In any case, I have always regretted that "null" is the default when creating a table with Oracle. I'd far prefer people to decide that they want a column to be nullable than the reverse - it would increase the odds that they understand nulls.
The problem of course is that there should be a difference between "Unknown" and "Not Applicable".
Simple example: I don't have a middle name. however, that does not imply that it is unknown.
The same goes for the state I lice in; Not the US, ergo N/A, and not unknown.
Using NULL in the database and then using NVL in the code only hides, but doesn't solve the problem stated in your post.
I think Oracle could make a lot of people happy if you could alter the session, or add a hint to the query to say "NULL = NULL", making the code much easier to read, doing away with reserved "unknowns", and maintaining the use of your indexes too.
@Stéphane Faroult
Whether you like it or not (the implementation of NULL) it is part of the standard and it is implemented in a standard fashion. If you understand what it is, what it does (that is true of everything) you have nothing to fear from it. If you "don't get it", then you don't get it (no more than someone that doesn't understand how to generate thread safe code would be in a multi-threaded environment - for example)
So, the fact that Codd doesn't like it doesn't mean anything really in the discussion of nulls - they are what they are.
@Robert
but that isn't a criticism of Nulls - that is a data model issue.
If you don't have a middle name, then yes, I would say it is Unknown (you don't have one, how could it be known?). You might need another attribute "has_middle_name" - but it really isn't an issue of null's. People have a hard time with tri-valued logic, imagine yet another dimension.
Allowing NULL=NULL would be in direct and complete violation of the standard - and the way it is to be implemented - which if you ask me would lead to MANY OTHER ISSUES, more than exist today (today we can learn what NULL is and how it works, if you change it - deviate from the standard - then all bets are off - people get different answers to the same queries - bad idea).
I don't know what you mean by "maintaining the use of your indexes too" - what was it you were trying to convey there? It is not like nulls are NOT indexed (they are, see the something about nothing link above)
@Robert
and by the way
where decode( expression, null, 1 ) = 1
is a way to say "null = null" if you want. It explicitly changes the where clause but does "null = null"
and in fact
where decode( expression1, expression2, 1 ) = 1
does it in 'general', returns 1 if expression1 is the same as expression2 - even if they are null...
"instead of using NULL for an effective_end_date field (for records that we don't know the end date for, they don't have one) "
and thereby guaranteeing that finding the current row (the one without effective end date) with
WHERE EFFECTIVE_END_DATE IS NULL
will unleash a FTS...
That has always been the base of the fear of using NULL. It's not NULL itself: it's what it does to indexing.
Of course, the correct thing to do is to have a flag for "current row".
Assigning the meaning of "current row" to effective_end_date = NULL is the equivalent of assigning a meaning to a NULL value: a recipe for disaster.
I do recall once writing in oracle-l "NULL has no meaning, period." and receiving a private reply from Lex de Haan with a single word: "Amen!"
@Noons
and thereby guaranteeing that finding the current row (the one without effective end date) with
WHERE EFFECTIVE_END_DATE IS NULL
will unleash a FTS...
that is entirely, provably false.
Please read something about nothing above.
It was never true that "where c is null" will unleash a full table scan.
It is not currently true
and it will not be true tomorrow. It is, has been and will be FALSE.
ops$tkyte%ORA10GR2> create table t ( x int primary key, y date );
Table created.
ops$tkyte%ORA10GR2> insert into t select object_id, case when mod(rownum,100) <> 0 then created else null end from all_objects;
50198 rows created.
ops$tkyte%ORA10GR2> create index t_idx on t(y,0);
Index created.
ops$tkyte%ORA10GR2> exec dbms_stats.gather_table_stats( user, 'T' );
PL/SQL procedure successfully completed.
ops$tkyte%ORA10GR2> set autotrace traceonly explain
ops$tkyte%ORA10GR2> select * from t where y is null;
Execution Plan
----------------------------------------------------------
Plan hash value: 470836197
-------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 492 | 6396 | 8 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| T | 492 | 6396 | 8 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | T_IDX | 492 | | 3 (0)| 00:00:01 |
-------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("Y" IS NULL)
ops$tkyte%ORA10GR2> set autotrace off
@Tom
Thanks, I'll play with that a bit, as it does solve the problem of the reserved "unknown value".
The performance is still going to suffer though, as decode is not the most economic function. Also, I'm not clear whether this will impact the use of indexes on the used columns like NVL(value,default) does.
@Robert
two things - see the comment right above yours - if you use "where column is null" we can and will use an index.
if you want the decode to use an index, it would be:
create index t_idx on t(decode(x,null,1));
and then you can query
where decode(x,null,1) = 1
and it would not only use the index, but the index would ONLY be on rows with nulls in column x.
@Tom
When I'm comparing with NULL, your function based index will work, however, Most often I'm not.
Given the example:
where decode( table1.column, table2.column, 1 ) = 1
Even if the database accepted a function index on something like this, I would have to make a separate index for every join combination with a different table.
I don't think there is a simple solution. There is a reason that this discussion never dies.
No Tom, it's not provably false.
A single column index on a column with NULL allowed will cause the optimizer to FTS on a IS NULL, every single time. That is the problem.
Of course you can circumvent it as you described by creating a "FBI" using a concatenation to a constant.
But that is a concatenated index, not a single column index.
Prove to me that a single column non-concat index can indeed avoid a FTS on a IS NULL.
Let's not forget as well that for a long time the CBO itself would immediately take a FTS, concatenated index or not, if someone specified IS NULL anywhere in a predicate.
It's only since skip-scan has been available that the CBO has become smarter with that case. And even then there are reported bugs on its use. Don't get me started on what the RBO does! It's not smething that's been there since year dot.
Also note that the use of a concatenation to a constant in an index is something that is entirely not clearly documented.
I'd have expected Oracle after all these years to include a section on the performance guide advising of this possibility.
Indexing of NULL values is a very common performance problem that could entirely have been avoided if the correct information had been made widely available.
The first reference I can find to work around this problem using concatenated indexes dates from around 2004.
Which is rather late in the proceedings for something that has been a problem since Oracle V4, all the way back to the 80s.
The fact that nowadays there are workarounds doesn't dispel the past problems.
That's why there is such angst on the subject of NULL indexing.
The past. Not the present.
Takes a long time to herd cats. ;)
@noons
you wrote:
and thereby guaranteeing that finding the current row (the one without effective end date) with
WHERE EFFECTIVE_END_DATE IS NULL
will unleash a FTS...
what is not false about that? I gave an example whereby finding the current rows with where effective_end_date is null DID NOT, WILL NOT unleash the dreaded full table scan.
So - you presume, assume that every index will be a single column index on a nullable attribute - it does not have to be.
So what if it is a concatenated index (and you need not use 0, you can use any non-nullable attribute in the table, no magic needed)
Please tell me why I need to 'prove a single column non-concatnated index ....' - I don't need to because
a) I cannot, we know that is not true
b) but so WHAT??? so WHAT???
I have shown that "where c is null" does not need to lead to a full table scan and if it is relevant (eg: where c is null is selective)... it is what we would be using.
And please remember with the effective end date stuff - if you are not going after tons of records (eg: there is something MORE than where c is null) you almost certainly have something else to index.
to find the current record for an employee: where empno = :x and edate is null
index(empno,edate) - done.
to find the current records for everyone in a department
index( deptno, edate ) - done (note: deptno can be nullable - I don't care)
it is only when you simply "where edate is null" and nothing else (which if you ask me would return a fair amount of the data - probably a full scan even if indexed, but let us say an index would make sense) - then you use a concatenated index with edate on the leading edge
and so what - who said "had to be a single column index" - that is not necessary, that is not a requirement, that is not reasonable.
@noons
Let's not forget that the fact that entirely NULL entries are not stored in a B*Tree index can actually be a *good thing*
A certain Mr Tom Kyte demonstrates this very nicely in his Effective Oracle By Design book (P416 - yes I looked it up for completeness).
The way I see it, you can either work *with* the way things work in the DB, or you can try and fight against it...if you choose the fight against route...you'll have a much unhappier time ;)
John.
Tom,
I have no fear for NULL but sometimes I get worried a little...
drop table tester purge;
create table tester (field1 number, field2 number);
insert into tester values (1,null);
insert into tester values (1,1);
insert into tester values (1,2);
insert into tester values (1,3);
insert into tester values (1,4);
insert into tester values (1,5);
insert into tester values (1,6);
insert into tester values (1,7);
insert into tester values (1,8);
insert into tester values (1,9);
insert into tester values (1,10);
commit;
select field1
, case when field2 is null then 0 else field2 end field2a
, case field2 when null then 0 else field2 end field2b
from tester
order by 1,2
;
Table dropped.
Table created.
1 row created.
1 row created.
1 row created.
1 row created.
1 row created.
1 row created.
1 row created.
1 row created.
1 row created.
1 row created.
1 row created.
Commit complete.
FIELD1 FIELD2A FIELD2B
---------- ---------- ----------
1 0
1 1 1
1 2 2
1 3 3
1 4 4
1 5 5
1 6 6
1 7 7
1 8 8
1 9 9
1 10 10
11 rows selected.
@MrTT
do not be afraid.
Do become 'educated' on how a SEARCHED CASE statement versus a 'normal' case statement works.
there are two types of CASE
case 1) searched case
case WHEN (tri-valued boolean)
THEN result
WHEN (tri-valued boolean)
THEN result
case 2) "normal" case
case (expression)
WHEN (equal to this value)
THEN
WHEN (equal to this other value)
THEN
....
Your fear is no more founded in this case then the fear someone has upon seeing this:
ops$tkyte%ORA10GR2> select rtrim( 'abc', 'cba' ) from dual;
R
-
ops$tkyte%ORA10GR2> select rtrim( 'Aaaabbbbccccc', 'xyzabc' ) from dual;
R
-
A
ops$tkyte%ORA10GR2>
How many times has someone come to asktom and said "why does this bug exist in rtrim/ltrim?"
They assumed that rtrim/ltrim took the second argument as a string and trimmed that STRING from the end.
But they do not, they take the second argument as a SET OF CHARACTERS and remove any/all of those CHARACTERS.
Once you know how something is defined to work - you can use it safely - until then, you cannot.
But that is true of everything (as I keep saying)...
Many things in life do not work the way I think they should work - that is my problem however, the solution is to learn how they actually do work and then use them appropriately.
Replace "I" with you and "me" with "your" in the above paragraph and it still applies.
The problem with NULL is not that this special exists, but that second special value does not exists. There are (as was already pointed out) 2 special cases - NO_VALUE (like "it is well known that this very person has no middle name") and UKNOWN (we do not know if this middle name exists, and what the name is if exists (e.g OCR system was unable to recognize it). Yes, there is workaround with flag (or "index") variable, but it is absolutely unclear why one of this special cases is recognized by standard and second completely ignored. Moreover, as far as I know, it is not clearly specified which of above mentioned cases NULL is supposed to represent (should admit that I have not read more or less recent version of the standard). And exactly this ambiguity causes a lot of confusion. Sometimes (or often?) NULL are used for both causing painful problems.
Tom,
I was alerted to your discussion of nulls, and having read both your and your contributors submissions, feel I must comment.
I'm not at all happy with the way the discussion of NULS has devolved into a very geeky 'how to program round' the consequences of accepting them into the database.
Lets go back to fundamentals. First - do you accept that Oracle (the
database) is founded on and is (to a greater or lesser degree) a
relational database? If no then my following arguments do not apply.
And in that case, I would be very interested to know what data model Oracle is based on if not the relational.
If yes - then whether you like it or not, you have to accept that the
relational data model is founded on two valued logic. Accepting nulls into the database IMMEDIATELY introduces three valued logic, which crucially compromises the integrity of the database. Period. No amount of fancy programming can circumvent this basic fact.
And the consequences? The results from complex queries (and indeed not so complex queries) can never be categorically proved in every case to be valid. With two valued logic - you CAN prove the correctness of any query. ANY QUERY. Using mathematics and logic.
The discussions in this thread are a classic example of what happens when people try to resolve the conflicts between two and three valued logic without really understanding them. You cannot resolve them by fancy programming.
Debating the meaning of nulls in an RDBMS is obfuscating the point. Stephane Faroult was absolutely right in his comments. So is the position argued by Lex de Haan, and so is that of Chris Date.
The most worrying phrase in your reply to Stephane reads as follows:
"Whether you like it or not (the implementation of NULL) it is part of the standard and it is implemented in a standard fashion." Which standard? Certainly not the relational standard. Unless, of course, Oracle has re-written the theory of the relational model, which I rather doubt!
One could be forgiven for thinking that this whole discussion on what nulls mean, and how to program around them, was centered on a file retrieval system, and not a relational database. The difference is, of course, profound.
Oh yes - how do you avoid nulls? By proper relational data design.
Peter
Edinburgh
@Peter: You are absolutely right that the problems introduced by NULL's 3-value logic should be circumvented by proper relational data design.
However, both you and Tom seem to overlook a few key problems here.
1. Many people are trying to compare data from legacy systems that they didn't built, and don't get to rebuild.
2. The business doesn't care that NULL means unknown. to them it's an empty field with an empty meaning. As my clients tell me "If I meant unknown I would have entered unknown"
3. When comparing existing data in a data warehouse with the full extract from the source, NULL is a specific column&record might actually mean unknown. But it's exactly the same identical unknown as the previous time, and not another. NULL isn't usually NULL , but in this case it is.
Regrettably, at the moment a lot of fancy programming is needed to deal with this.
@Robbert,
You too are quite correct. The real world demands that we deal with nulls, which can often require ugly solutions.
I was really highlighting the fact that the discourse had launched into the question of nulls without first recognising that they should be avoided whenever possible. The bottom line is that they are fundamentally incompatible with a relational database - people need to understand this first and foremost. Proper data design can remove this potentially dangerous weakness in the database.
So I would say try and redesign the database correctly to avoid having to use nulls in the first place. If you simply cannot do this, then your last resort is to try and programme around the problem, but this is fraught with difficulties (logical and practical).
It is possible for people of Tom's calibre to do this (and there is no doubt that it is tricky, beginning with trying to answer the question of what any one null is supposed to mean when null, sensu stricto, is meaningless), but when left to less experienced users (as Stephane pointed out) there is no guarantee that the results will be the correct ones.
Just consider this scenario. A major accident takes place, involving loss of life. The cause is traced back to the data queried from a database system. The programmers have to demonstrate that their query data is not at fault. If the data were derived from a true, 2VL database - they stand every chance of avoiding any subsequent problems of litigation. If data derives from a 3VL database, they have a hell of a problem to prove the integrity of their result.
I just wish more senior management realised the crucial importance of the methodolgy behind their data management systems. Failing to do so can result in simply enormous costs, both human and financial.
Peter
yes, null for "effective end date" ... Oracle HR uses 31-DEC-4712, maybe you should tell them!
@Peter Robson
"Which standard?" - SQL standard.
And Oracle is not a "strict" relational database but SQL database. And SQL standard requires NULLs.
The pure relational model is unusable - it lacks a lot of the concepts mandatory for any real-world database (e.g. ordering). So it must be extended.
And the three-valued logic (and more generally, multi-valued logic) is a part of the mathematics and by no means preclude mathematical analysis of the model, queries and so on.
Moreover, even fuzzy-logic exists.
Now back to the original relational model. You may be unaware that original works of Codd are plagued with several not so minor mathematical errors (one of them I discovered and published near 30 years ago).
@all,
I'll just say one more thing
pure or not
academically perfect or not
pleasant to you or not
They (nulls) exist.
If you do not know how they work - you will say all kinds of silly things like that first referenced blogger did above - all of which were FUD (fear, uncertainty, doubt) factors. None of them were technical.
If you know how they work, you can do things better.
Period.
Nothing more, nothing less.
It is like everything 'technical'. It works the way it works - it does not have to work the way YOU think it should have worked or how YOU would have done it. It is done, it does what is does in the manner it does it.
And if you don't learn how it works, you cannot use it safely.
I purposely wrote those last two paragraphs very generically. "it" can be "anything" - "it" could be "nulls", "it" could be SELECT, outer joins, shared server, whatever...
@al0
And the three-valued logic (and more generally, multi-valued logic) is a part of the mathematics and by no means preclude mathematical analysis of the model, queries and so on.
The fact here is that NULLs are in violation of the mathematical relational model. A relation consisting of a NULL or NULLs is nonsensical, from a mathematical perspective. And the anomalies with NULLs arise (and they are more pervasive than any other 'feature' of SQL databases) precisely because we deviate from the mathematical paradigm.
They are a fact of SQL database life, yes, but that doesn't mean that should be blind to the fundamental corruption of the Relational Model that NULLs constitute.
The pure relational model is unusable - it lacks a lot of the concepts mandatory for any real-world database (e.g. ordering). So it must be extended.
Extension of the Relational Model is not the issue; the violation of the Relation Model is. Let's not confuse extension with violation.
I don't fear nulls. I fear developers that use default values in place of nulls. Such as, 31-DEC-9999 as a replacement for nulls. Talk about data skew. Now the date range for low_value and high_value in user_tab_cols is off the charts. Yes, you could create a histogram and hope the CBO decides to use it. But you don't need to. Take advantage of null and create a concatentated index like Tom showed above; and remove the unneccesary logic of NVL(some_date,TO_DATE('31-DEC-9999','DD-MON-YYYY')).
OFF TOPIC:
Tom, it seemed that the "asktom" plug-in for FF is not working anymore. any plans to update it?
thank you
andrew
@anonymous
see the asktom home page, link to updated plugin with new URL is there.
Look for 'Install the Firefox search plugin for asktom'
thank you Tom. working ( plugin) now.
Andrew
POST A COMMENT
<< Home