One of the most effective tools for managing data quality and data governance, as well as giving business users confidence in the data warehouse results, is the audit dimension. We often attach an audit dimension to every fact table so that business users can choose to illuminate the provenance and confidence in their queries and reports. Simply put, the audit dimension elevates metadata to the status of ordinary data and makes this metadata available at the top level of any BI tool user interface.
The secret to building a successful audit dimension is to keep it simple and not get too idealistic in the beginning. A simple audit dimension should contain environment variables and data quality indicators, such as the following:
The audit dimension, like all dimensions, provides the context for a particular fact row. Thus when a fact row is created, the environment variables are fetched from a small table containing the version numbers in effect for specific ranges of time. The data quality indicators are fetched from the error event fact table that records data quality errors encountered along the ETL pipeline. We have written extensively about the error event fact table and the audit dimension. See especially the white paper An Architecture for Data Quality on our website. This short Design Tip is really just a reminder for you to build your audit dimension if you have been putting it off!
The environment variables in the above figure are version numbers that change only occasionally. The ETL master version number is a single identifier, similar to a software version number, that refers to the complete ETL configuration in use when the particular fact row was created. The currency conversion version is another version number that identifies a specific set of foreign currency conversion business rules in effect when the fact table row was created. The allocation version is a number that identifies a set of business rules for allocating costs when calculating profitability. All of these environment variables are just examples to stimulate your thinking. But again, keep it simple.
The data quality indicators are flags that show whether some particular condition was encountered for the specific fact row. If the fact row contained missing or corrupt data (perhaps replaced by null) then the missing data flag would be set to true. If missing or corrupt data was filled in with an estimator, then the data supplied flag would be true. If the fact row contained anomalously high or low values, then the unlikely value flag would be true. Note that this simple audit dimension does not provide a precise description of the data quality problem, rather it only provides a warning that the business user should tread cautiously. Obviously, if you can easily implement more specific diagnostic warnings, then do so. But keep it simple. Don’t try to win the elegance award.
The white paper mentioned above does a deep dive into more sophisticated versions of the audit dimension, but I have been concerned that the really advanced audit dimension designs are daunting. Hence this Design Tip.
Finally, if you build an audit dimension, show it to your business users. Here’s a before and after portion of a simple tracking report using an out of bounds indicator with values “Abnormal” and “OK” that provides a useful warning that a large percentage of the Axon West data contains unlikely values. The instrumented report is created just by dragging the out of bounds indicator into the query. Business users are surprisingly grateful for this kind of information, since not only are they curious as to why the data has been flagged, but they appreciate not making business decisions based on too little information.