by Jay Johansen | Apr 1, 2008
A classic problem with many computer systems is their inability to distinguish between zero or blank and "not specified". A few months ago I was browsing a real estate site and there were several houses listed with prices of "$0". I'll go out on a limb here and guess that they were not really giving these houses away for free. The system just had no way to say "price unknown". In this case there's unlikely to be much confusion, but it's not hard to think of cases where this could be ambiguous. Like, how many children do you have? Zero is a plausible number. There's a big difference between "I don't know how many children the Smiths have because I lost count around 17" and "The Smiths have no children".
Even in cases where it's obvious looking at one record that 0 means "unknown", when we calculate aggregates this gets lost. Suppose on the real estate database we want to know the average price of a home sold in a certain town. If the sale prices are $100,000, $200,000, and unknown, would you say the average price is (100+200+0)/3=$100,000? Surely a better answer would be $150,000.
Zero is a real live number, and is entitled to the same respect as all the other numbers.
SQL solves this problem with the idea of "null". A field can be allowed to be null, which means "unknown" or "not specified". This can be distinguished from blank or zero. Brilliant idea.
And then the authors of SQL went nuts.
I'm sure you've heard the phrase, "That sounds great in theory, but it doesn't work in practice." SQL nulls are an excellent example of this. From an abstract, mathematical point of view, the rules are arguably correct. But they are totally impractical and unrealistic.
SQL has two key rules about null handling.
(There are also some rules about logical operations that I won't get into here because they're more of the same.)
By a rigorous mathematical definition, this makes sense: The sum of 42 plus an unknown amount is an unknown amount. Appending an unknown string to "X" gives an unknown result. If we ask, is 42 equal to some amount that is unknown, the result is unknown. How do we know whether it's equal or not?
But in practice, this results in all sorts of annoying and unnecessary complexity.
Right off the bat, how do we determine if a field is null? By the above rules, we can't write "if x=null". The result of this comparison is always null. So instead we have a special construct: we must write "if x is null". I've seen many, many SQL bugs that turned out to be a programmer accidentally writing "x=null" when he meant "x is null". But why do we need this special construct? If "x=null" had some useful meaning, I could understand saying that we must create a new construct to test x for null. But in fact "x=null" is never a worthwhile thing to write: the answer is always null. So why did the creators of SQL not simply make it work the way 99% of the programmers in the world would intuitively expect it to work? The only reason I can see is because they were committed to the dogma that anything compared to null must give a null result.
But okay, I can deal with that. It's only a few extra characters.
More important, consider the sort of operations we normally do and how we would likely want to handle nulls.
Suppose we have a table in our database for customers, with first and last name as separate fields. This is commonly done so that we can easily present the name as "first last" or as "last, first" or other formats (among other reasons, like searching).
We might well be tempted to get a list of customers with a query like "select firstname||' '||lastname from customers". If a customer has firstname of 'Fred' and lastname of 'Smith', this will give 'Fred Smith'. But what happens if the person who entered this customer knew his last name was 'Smith' but didn't know his first name, and so (correctly) entered null? You'd think we'd at least want the 'Smith' part to show up on the report. But it won't. The whole thing will be null, and thus display as "null" or blank. Yes, it is true that unknown plus 'Smith' gives a result that is unknown. But ... we know that at least part of it is 'Smith'. Why is this fact thrown away?
We can, of course, write a more complex expression to deal with this. Like "select case when firstname is null then lastname when lastname is null then firstname else firstname||' '||lastname end as fullname from customers". But ... why is this necessary? It's a lot to type, unclear to read, complex which means more likely to have errors, etc. When more fields are being concatenated together or more complex string operations are being performed, this quickly spirals out of control.
The case with arithmetic is less clear-cut but similar. You might naively think that this query would work: "select sale.price+salestax.amount from sale left join salestax using (receiptnumber)". We would logically use a left join because some sales do not have sales tax -- some mail order sales, sales to non-profits, etc. Maybe you see the problem because I've got you thinking about nulls, but would you see it if we were just talking about sales? If there is a sale with no sales tax and thus no sales tax record, then all the sales tax fields are null. And sale.price plus null is null. So while you might think that a $10 sale with no sales tax would make a total of $10, in SQL it makes a total of "unknown".
Again, we could get the correct answer by writing a more complex query. We could write, "select sale.price+coalesce(salestax.amount,0) from sale left join salestax using (receiptnumber)". But again, this is a pain.
Comparisons involving nulls give truly bizare results. Any comparison against null gives a result not of true or false, but of null. But when you have to take action based on a comparison, you have to either do something or not do something. There is no third alternative. In practice, if the final result of a logical expression is null, SQL treats this as false.
A very common thing to want to do is break up a report by some criteria. Here's an example that doesn't sound at all far out to me. Suppose we have a business that is based in, say, Illinois. We want to produce one report for all of our in-state customers and another report for our out-of-state customers. You might think "select * from customer where state='IL'" and "select * from customer where state!='IL'". That won't work. A customer whose state is null will not show up on either report. Some customers will be mysteriously dropped. Realistically, wouldn't you think that state not specified should show up on the "not Illinois" report? Yes, if you want to be pedantic, you could argue that if the state is unknown, it might be Illinois. But surely we don't want these customers to just be left out completely. We want them on one report or the other, and the non-Illinois report is the obvious one to put them on.
Yes, we could get this result by writing "select * from customer where state!='IL' or state is null". But that is not natural or intuitive. For a simple query like this one it might not seem to bad, but as the query gets more complex, it gets out of hand rapidly.
The authors of SQL apparently realized that their stated rules don't give useful results and broke them in their aggregate functions.
The sum() function totals a given field across all records meeting the condition. Except that if a record has null, this does not make the total null as the rules would require. They simply don't add it in.
Likewise avg() is supposed to calculate an average. If any of the fields are null, by the rules the average should be null. But it isn't. avg is actually very intelligent about this: It excludes the null value from the calculation. It doesn't treat it as zero: it excludes it. So the avg() of 1, 3, and null is 2: The average of 1 and 3, with the null ignored.
The way the avg() aggregate function works make sense. An operation involving null should not result in the entire result being thrown away. The nulls should simply be ignored. For most purposes this means that null should be treated like zero if used as a number of empty string if used as a character string. For a few purpose the handling would be a little more complex.
For comparisons, null should be different from any non-null value, including zero or blank. But null should equal null.
By a pedantic mathematical analysis, perhaps the above could be declared technically wrong. But it would be extremely practical for real-world problems.
Okay, not like I think the international standards committees are going to change the SQL standard because I said so. But that's my opinion, and I like to complain.
© 2008 by Jay Johansen
No comments yet.