SQL Server Forum / Programming / SQL / July 2008
To NULL or not to NULL
|
|
Thread rating:  |
Elkestra - 29 Jul 2008 13:38 GMT Purely theoretical while I'm finalising the design currently, so DDL isn't directly relevant unless requested.
I'm currently writing a recipe database for an online game. Each recipe is assigned a level within its appropriate skill, at which it becomes available. When data is entered, initially some recipes will be unassigned level-wise.
My first though was to have the level column (bigint) nullable, to indicate that the level wasn't yet known, but having read all the anathema for NULL values, would I be better making it NOT NULL, and then have it default to 0 - whilst have a bit-column flagging its 'level unknown' status, since 0 can also mean 'always available'.
Related to this, how would I deal with the recipe's skill-type if currently unassigned. Again, first thought would be NULL, but would having an '[Unassigned]' skill-type in the skills table be a better method?
Yours,
Elkestra (Ann-Marie)
Rick Sawtell - 29 Jul 2008 14:03 GMT > Purely theoretical while I'm finalising the design currently, so DDL isn't > directly relevant unless requested. [quoted text clipped - 18 lines] > Elkestra > (Ann-Marie) In my design world, I would use a lookup table with a FK constraint as follows.
CREATE TABLE dbo.SkillLevel ( SkillLevelID int IDENTITY (1,1) NOT NULL PRIMARY KEY CLUSTERED, SkillLevelDescription varchar(50) NOT NULL )
INSERT dbo.SkillLevel (SkillLevelDescription) VALUES ('Unknown') INSERT dbo.SkillLevel (SkillLevelDescription) VALUES ('1st Level') INSERT dbo.SkillLevel (SkillLevelDescription) VALUES ('2nd Level')
CREATE TABLE dbo.Recipe ( RecipeID int IDENTITY (1,1) NOT NULL PRIMARY KEY CLUSTERED, RecipeName varchar(50) NOT NULL, RecipeDescription varchar(50) NOT NULL, SkillLevelID int NOT NULL,
CONSTRAINT FK_Recipe_SkillLevel FOREIGN KEY (SkillLeveLID) REFERENCES dbo.SkillLevel(SkillLevelID) )
CREATE TABLE dbo.Components ( ComponentID int IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED, ComponentName varchar(50) NOT NULL, )
CREATE TABLE dbo.RecipeComponents ( RecipeID INT NOT NULL, ComponentID INT NOT NULL,
CONSTRAINT FK_RecipeComponents_Recipe FOREIGN KEY (RecipeID) REFERENCES dbo.Recipe(RecipeID), CONSTRAINT FK_RecipeComponents_Components FOREIGN KEY (ComponentID) REFERENCES dbo.Components(ComponentID) )
Rick Sawtell
Roy Harvey (SQL Server MVP) - 29 Jul 2008 14:15 GMT For all the discussion about NULLs it really isn't that big a deal. It actually can work quite well for missing information, which is what you appear to have. Assigning a value that represents the same thing is another perfectly valid answer. I dislike the idea described that 0 actually have two very different meanings depending on the flag, so I would be looking for some other approach.
If the recipe skill type means a recipe is only associated with one skill then it is really the same question. If, on the other hand, a recipe can be associated with N skills I would suggest a "junction table", a table with two columns, the PK of Recipes and the PK of Skills, which together make up the PK of the RecipeSkill table. In that case instead of NULL or some special code you simple don't have any row for a specific recipe.
I have to mention that the idea of using a bigint to represent what sounds like a skill level, well that seems a bit extreme, but of course I don't understand the requirements.
Roy Harvey Beacon Falls, CT
>Purely theoretical while I'm finalising the design currently, so DDL isn't >directly relevant unless requested. [quoted text clipped - 18 lines] >Elkestra >(Ann-Marie) Razvan Socol - 29 Jul 2008 14:38 GMT Hello, Elkestra
If the level is unknown, I would use NULL in the Level column, so it will not appear in any level. If you want to say 'always available', I would use 0 in the Level column. Therefore, both values are acceptable, but with different meanings. When the data entry is complete, you may want to check for null values and assign them to a certain level or to 0 (always available). And maybe change the definition of the column to disallow NULLs thereafter. Anyway, I would not use a "LevelUnknown" bit column.
 Signature Razvan Socol SQL Server MVP
--CELKO-- - 29 Jul 2008 21:51 GMT >> Each recipe is assigned a level within its appropriate skill, at which it becomes available. When data is entered, initially some recipes will be unassigned level-wise. <<
>> My first though was to have the level column (bigint) nullable, to indicate that the level wasn't yet known, but having read all the anathema for NULL values, would I be better making it NOT NULL, and then have it default to 0 - whilst have a bit-column flagging its 'level unknown' status, since 0 can also mean 'always available'. << I assume that the skill_level has a range, like 0 to 10 or 100. Why are you using BIGINT??? That is more than the number of atoms in the Universe! You are inviting garbage data. And, yes, zero is the right way to go, not a NULL.
>> Related to this, how would I deal with the recipe's skill_type if currently unassigned. Again, first thought would be NULL, but would having an '[Unassigned]' skill_type in the skills table be a better method? << Yes it is, but why do you have a recipe without a skill_type -- BBQ, baking, sauces, deserts, vegetables, fish, etc. ought to be known at the start. Maybe I don't understand the problem.
Get a copy of my book SQL PROGRAMMING STYLE and read chapter 5 on the design of encoding schemes.
Hugo Kornelis - 30 Jul 2008 23:10 GMT >My first though was to have the level column (bigint) nullable, to indicate >that the level wasn't yet known, but having read all the anathema for NULL [quoted text clipped - 5 lines] >unassigned. Again, first thought would be NULL, but would having an >'[Unassigned]' skill-type in the skills table be a better method? Hi Elkestra / Ann-Marie,
I have to disagree with Rick Sawtell and Joe Celko, and agree with Razvan Socol. Forget what you have read about NULLs being bad, or hard, or whatever. They are not.
What *IS* hard, is dealing with missing information. You have to find some way to represent that in the table; then yoou have to make sure that all queries properly understand and deal with that representation (for instance, a query for the average level should not take into account the recipes with 0 if you use 0 to represent unknown difficulty, otherwise your result is skewed).
In my experience, NULLs may not be the best solution for the missing information problem, but all other solutions are far worse, making NULLs the answer of my choice to this problem. Unless, of course, I can tackle it at the root and ensure that no information will be missing - but that is in many cases way beyond the control of the lowly DBA or database developer.
Most of the problems people experience with NULLs fall in either of these two categories:
1) They are actually problems with missing information. Using some magic value to represent the missing information instead of NULL will not solve the problem.
2) They are caused by improper design (and esp. improper use of NULLs in the database design), or by insufficient understanding of what NULLs are and how they affect query results.
If you use NULLs to represent the fact that some specific bit of information (like, for instance, the level of a recipe) is missing in your database, you should just be fine - as long as you make sure that you know how NULLs affect query results.
<shameless self-plug>
Recommended reading: http://sqlblog.com/blogs/hugo_kornelis/archive/2007/07/06/null-ndash-the-databas e-rsquo-s-black-hole.aspx http://sqlblog.com/blogs/hugo_kornelis/archive/2007/07/17/the-logic-of-three-val ued-logic.aspx http://sqlblog.com/blogs/hugo_kornelis/archive/2007/09/22/dr-unknown-or-how-i-le arned-to-stop-worrying-and-love-the-null.aspx http://sqlblog.com/blogs/hugo_kornelis/archive/2007/09/30/what-if-null-if-null-i s-null-null-null-is-null.aspx
</shameless self-plug>
 Signature Hugo Kornelis, SQL Server MVP My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis
--CELKO-- - 30 Jul 2008 23:19 GMT >> 1) They are actually problems with missing information. Using some magic value to represent the missing information instead of NULL will not solve the problem. <<
I use these heuristics:
1) Avoid them if you can. Use a special value and put it into a DEFAULT clause, if the domain of your column has it.
2) If you have a NULL, make sure it has one and only one meaning in that domain. Example: The ICD codes for disease use 000.000 for "Undiagnosed, you just got to the hospita" and 999.999 for "Diagnosed, and we don't know what it is" for two very different kinds of missing data.
3) Make sure the propagation property of NULLs makes sense in your data model. Otherwise, use COALESCE() to replace it with another value.
4) Remember that you must use "CAST (NULL AS <datatype>)" in places where the compiler needs to know the datatype of a column or variable.
Hugo Kornelis - 31 Jul 2008 15:03 GMT >>> 1) They are actually problems with missing information. Using some magic >value to represent the missing information instead of NULL will not [quoted text clipped - 5 lines] >DEFAULT >clause, if the domain of your column has it. Hi Joe,
Never use magic values. Didn't they teach you anything in database 101? Or were you that guy that was sleeping through the lessons in the back of the class?
Next week, business requirements may change and what you thought would never be a "real" value suddenly does become real. Ouch!
Or you forget to exclude the magical value from an aggregate query or from a range search and get your results all mixed up. More ouch!
NULL will, by definition, never become a real value in the business, since it's defined as the absence of any data value. (Check the SQL standards; I'm sure you have a copy somewhere).
>2) If you have a NULL, make sure it has one and only one meaning in >that >domain. NULLL is defined as "(a marker indicating) the absence of any data value". What other definition can there be? What other meaning could it possibly have?
> Example: The ICD codes for disease use 000.000 for >"Undiagnosed, you just got to the hospita" and 999.999 for "Diagnosed, >and we don't know what it is" >for two very different kinds of missing data. Bad example. First, this is ICD-9. ICD-10 was already completed in 1992 and most of the world has quickly adapted it. And ICD-11 is expected to be complete in a few years.
But what's worse, is that this is simply not true. The codes you mention are not defined; instead, there is a code 799.9: "Other unknown and unspecified cause". It took me no more that five minutes of Googling to find this site:
ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Publications/ICD9-CM/2007/
Download Dtab08.zip and review the full list of codes. Enjoy!
But besides this beinig a bad example, there is something way more fundamentally wrong with your argument. A disease is either diagnosed or not. If it's not diagnosed, then there is no diagnosis. Period.
There may be various reasons for the absence of a diagnosis. Like the two you mention, or "we have a diagnosis but lacked the time to enter the information into the computer", or "some new disease that is not yet defined in ICD-9, so please see the paper trail". Or others. And that reason might be relevant for the application, or it might not be. If it's not, then NULL for the diagnosis is all you need. And if the reason for a missing diagnosis *is* important, than you should define that as a new column. Because in databases, we never combine multiple attributes in a single column - that would be a violation of first normal form. Again, database 101.
>3) Make sure the propagation property of NULLs makes sense in your >data >model. Otherwise, use COALESCE() to replace it with another value. That's just a different way to say what I already said: make sure you know how NULLs affect query results.
>4) Remember that you must use "CAST (NULL AS <datatype>)" in places >where the compiler needs to know the datatype of a column or variable. That is only required if you use a hard-coded NULL in a situation where the datatype can not be inferred from something else in the query; a situation I have never seen outside of weird, carefully construed classroom examples.
Real queries deal with data stored in columns in tables or (sometimes) in variables. Both columns and tables have a datatype, so the DBMS knows the datatype whether the value is NULL or not.
 Signature Hugo Kornelis, SQL Server MVP My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis
--CELKO-- - 31 Jul 2008 17:29 GMT >> NULL is defined as "(a marker indicating) the absence of any data value". What other definition can there be? What other meaning could it possibly have? << Using your logic, zero is a number, so it must have the same meaning in every domain where it is used? No, of course not. Zero can mean "my gas tank is empty" or "this water is frozen" depending on the domain. In the case of NULL it might mean "we do not have an end date for a duration yet", "This customer does not have a FAX machine", etc.
If you look around, you will find that a lot of encoding schemes have a "N/A", "miscellaneous", or other special values for different situations. The most common example is the NaN values in IEEE floating point numbers for us.
Hugo Kornelis - 31 Jul 2008 20:42 GMT >>> NULL is defined as "(a marker indicating) the absence of any data value". What other definition can there be? What other meaning could it possibly have? << > >Using your logic, zero is a number, so it must have the same meaning >in every domain where it is used? Yes. Zero always has the meaning of an amount that is exactly one less than one (or however mathematicians would define it). Zero should never have the meaning of sevenhundred and fourty three, or Orange.
> No, of course not. Zero can mean >"my gas tank is empty" Zero gallons.
Zero means one less than one, and gallons is a unit of measurement for fluid amounts.
> or "this water is frozen" Zero degrees centigrade.
Zero means one less than one, and degrees centigrade is a unit of measurement for temperatures.
Of course, these two zeroes will be interpreted differently by the end user, but there should not be ANY difference in how the database treats either of them. Take any value x, add either zero gallons or zero degrees centigrade to it, and the result should still be equal to x (even though the addition of extra gallons might make sense if x is also a fluid amount and the addition of degrees centigrade might not - with the current support for domains in SQL Server, there's no way anyone can expect SQL Server to know which one is valid and which one isn't; to SQL Server zero is just the number zero; nothing more, nothing less.
> depending on the >domain. In the case of NULL it might mean "we do not have an end date >for a duration yet", "This customer does not have a FAX machine", >etc. No. It just means that there is no end date known to the database, or that there is no fax number known to the database.
If the end user happens to know that there will always be a fax number in the database unless the customer doesn't have one - good for him. But that does not mean that the MEANING "no fax machine" is somehow encoded in the NULL, just as the MEANING "degrees centigrade" is not encoded in the value zero.
>If you look around, you will find that a lot of encoding schemes have >a "N/A", "miscellaneous", or other special values for different >situations. The most common example is the NaN values in IEEE >floating point numbers for us. 1) Many encoding schemes are designed for use in flat file or XML data interchanges. They should not be confused with relational principles. What makes sense in an information interface should not automatically be copied into the database. That's what staging tables are for.
2) Ain't it funny that of all the examples you give here, all just state the absence of a real value without giving any reason for WHY the value is absent?
(Though I'll immediately grant that there are lots of encodings where magic numbers ARE indeed used to specify various reasons why a regular value is not on file - like I said, makes sense in XML or flat file, but not in a properly normalized database).
 Signature Hugo Kornelis, SQL Server MVP My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis
Aaron Bertrand [SQL Server MVP] - 31 Jul 2008 15:15 GMT > 1) Avoid them if you can. Use a special value and put it into a > DEFAULT > clause, if the domain of your column has it. This sounds a lot like the "magic" values you always harp against, such as the reason 50 and 255 are such popular choices, and why a punch card has 80 columns. I used to think like this too, but I have since learned that nothing really good comes from magic numbers made to look like real data just because NULL raises panic for some reason.
> 2) If you have a NULL, make sure it has one and only one meaning in > that > domain. Example: The ICD codes for disease use 000.000 for > "Undiagnosed, you just got to the hospita" and 999.999 for "Diagnosed, > and we don't know what it is" > for two very different kinds of missing data. Neither of those are really candidates for NULL, since there is SOME data about it, and NULL is supposed to represent the absolute absence of data. If this data you cite is important (obviously it is), then it should be part of the model, period. Whether that means the ICD_Code column has more candidate values (probably all 1 million possible codes are in a check constraint, right?), or the reason for the NULL value is explained in another column, I am not sure...
A
|
|
|