Friday, April 22, 2011

SQL Server: Joining on NULLs

Today I learned you can't join tables on NULLs because SQL Server deems a NULL to be an ambiguous quantity so two NULLs are not the same NULL. You can force SQL Server to do the join by doing something like this
... colors c JOIN points p ON (c.color = p.color OR (c.color IS NULL AND p.color IS NULL))
This example assumes only one team will have a NULL color.
CREATE TABLE colors (
  team      varchar(16) NOT NULL,
  color     varchar(32) NULL,
)
CREATE TABLE points (
  color     varchar(32) NULL,
  points    int
)

INSERT INTO colors VALUES ('lobsters','red')
INSERT INTO colors VALUES ('swans','white')
INSERT INTO colors VALUES ('jellyfish',NULL)

INSERT INTO points VALUES ('red',100)
INSERT INTO points VALUES ('white',90)
INSERT INTO points VALUES (NULL,80)
SELECT * FROM colors c JOIN points p ON c.color = p.color
-- returns:
--lobsters         red                              red                              100
--swans            white                            white                            90
SELECT * FROM colors c JOIN points p ON (c.color = p.color OR (c.color IS NULL AND p.color IS NULL))
-- returns:
--lobsters         red                              red                              100
--swans            white                            white                            90
--jellyfish        NULL                             NULL                             80

1 comment:

Unknown said...

I also assumed NULL joins would work, expecting the only issue to be runaway record duplication in the case of NULL-to-NULL joins. As you've discovered, SQL won't join on NULL, or any ambiguous values.

You suggested:
... colors c JOIN points p ON (c.color = p.color OR (c.color IS NULL AND p.color IS NULL))

This will work, but the ORing of two ON conditions, and the explicit ANDing of IS NULL checks may put a significant load on server resources, especially when joining large tables. I would suggest using the IsNull function to alias NULLs to a reserved literal, such as the pipe character:

... colors c JOIN points p ON (IsNull (c.color, '|') = IsNull (p.color, '|'))

This will allow the server to optimize the NULL conversion with the native logic of the IsNull function. You should probably explicitly trap the NULL-to-NULL case as well, to prevent wasting resources generating massive unusable datasets.

(apologies for the cross-post in LinkedIn comments)