Design Tip #43 Dealing with Nulls in the Dimensional Model describes two cases where null values should be avoided in a dimensional model; in these situations, we recommend using default values rather than nulls. This Design Tip provides guidance for selecting meaningful, verbose defaults.
Handling Null Foreign Keys in Fact Tables
The first scenario where nulls should be avoided is when we encounter a null value as a foreign key for a fact table row during the ETL process. We must do something in this case because an actual null value in a foreign key field of a fact table will violate referential integrity; the DBMS may not allow this situation to happen. There are a number of reasons why we have no foreign key:
- There is a data quality issue because the key value provided by the source system is invalid or incorrect.
- The dimension itself is not applicable for the particular fact row.
- The foreign key value is missing from the source data. In some case, this missing data is another data quality issue. In other cases, the foreign key legitimately is not known because the event being tracked has not yet occurred as frequently happens with accumulating snapshot fact tables.
When dealing with null foreign keys, we suggest applying as much intelligence in the ETL process as possible to select a default dimension row that provides meaning to the business users. Do not simply set up a single default row and point all default scenarios to the same row. Consider each condition separately and provide as many default rows as needed to provide the most complete understanding of the data as possible. At a minimum consider the following default rows:
- Missing Value – The source system did not provide a value that would enable looking up the appropriate foreign key. This could indicate a missing data feed in the ETL process.
- Not Happened Yet – The missing foreign key is expected to be available at a later point in time.
- Bad Value – The source provided bad data or not enough data to determine the appropriate dimension row foreign key. This may be due to corrupted data at the source, or incomplete knowledge of the business rules for this source data for this dimension.
- Not Applicable – This dimension is not applicable to this fact row.
Every dimension needs a set of default rows to handle these cases. Usually the ETL team assigns specific values such as 0, -1, -2, and -3 to the keys that describe these alternatives. The choice of the specific key value generally makes no difference, but there is one weird special case. When the calendar date dimension is used as the basis for partitioning a large fact table (say, on the Activity Date of a set of transactions), care must be taken that the Activity Date always has a real value, not one of the exceptional values, since such an exceptional record will get partitioned off to Siberia in the oldest partition if it has a key value of 0!
Handling Null Attribute Values in Dimension Tables
Nulls should also be avoided when we can’t provide a value for a dimension attribute in a valid dimension row. There are a several reasons why the value of a dimension attribute may not be available:
- Missing Value – The attribute was missing from the source data.
- Not Happened Yet – The attribute is not yet available due to source system timing issues.
- Domain Violation – Either we have a data quality issue, or we don’t understand all the business rules surrounding the attribute. The data provided by the source system is invalid for the column type or outside the list of valid domain values.
- Not Applicable – The attribute is not valid for the dimension row in question.
Text attributes in dimension tables usually can contain the actual values that describe the null conditions. Try to keep in mind the effect on BI tools downstream that have to display your special null value description in a fixed format report. Avoid tricks we’ve seen, such as populating the default attributes with a space or meaningless string of symbols like @@@ as these only confuse the business users. Consider the default values for each dimension attribute carefully and provide as much meaning as possible to provide context to the business users.
Numeric attributes in dimension tables will need to have a set of special values. A value of zero often is the best choice because it is usually obvious to the users that it is artificial. Some numeric attributes will present you with a difficult choice if the business users combine these values in numeric computations. Any actual numeric value used to stand in for null (say, zero) will participate in the computation but give misleading results. An actual null value often will cause an error in a computation, which is annoying but at least does not produce a falsely confident result. Perhaps you can program your BI tool to display null numeric dimension attributes with “null” so that you can both report and compute on these attributes without worrying about distorted data.
Finally, these default value choices should be re-used to describe common null conditions across business processes and dimension tables in your dimensional data warehouse.