One of the tasks of the ETL system’s customer dimension manager is to “assign a unique durable key to each customer.” By durable key, we mean a single key value that uniquely and reliably identifies a given customer over time. In most cases, this unique durable key is the natural business key from the operational systems, and all we have to do is copy it over as an attribute in the dimension table. However, there are cases where the natural key changes, and when it does, the dimension manager has to step in.
Some common causes for natural key changes include business reasons, duplicate entries, and integration of data from multiple sources. All of these require the creation and management of a unique durable key, also known as a super-natural key, during the ETL process.
A good example of a natural key change for business reasons comes from the credit card industry. Credit card account numbers are natural keys; they appear in transactions, and are mapped to surrogate keys in the dimension table. If a credit card is stolen, a new account number is issued. Without awareness of this change, the new account number would look like a totally new account and be entered as a new entity in the account dimension. The full history of the account would be lost because it now has two natural keys.
When this kind of business-driven change happens, the transaction system must generate a notification record telling the ETL process that a new account has been created to replace an old account. This could be as simple as a table with the old account number, the new account number, and the effective date. The ETL system must then create a new row with the new account number and a separate durable key column that ties the old and new accounts together. Figure 1 shows how this durable key would look in an account dimension.
Figure 1 – A durable key in an account dimension.
In Figure 1, the Account_Key is the surrogate key assigned by the ETL system to uniquely identify each row. The Account_ID is the ETL substitute for the natural key from the transaction system because you would not usually load a sensitive element such as a credit card account number directly into the data warehouse. The third key column, called the Durable_Account_ID, is the durable key assigned by the ETL system to tie all related rows together. Figure 1 shows four rows for the same account because there were two Type 2 changes to track, one state change and one last name change, in addition to the Account_ID change.
Another useful design pattern is to add the durable account key to the fact table in addition to the dimension’s surrogate key. This joins back to the current rows in the dimension to make it easier to report all of history by the current dimension attributes. See the article titled “Slowly Changing Dimensions Are Not Always as Easy as 1, 2, 3” for more information.
Dealing with duplicate entries in a dimension, or integrating disparate sources into a single dimension, involves more complex business logic that relies on durable keys. The end result is similar to Figure 1, but the integration process has to generate the list of related items rather than relying on the transaction system. Figure 2 shows the integration of products from multiple source systems. The MDM subsystem in the ETL process has identified these three products as the same and assigned them to the same durable key.
Figure 2 – Duplicate product entries.
In this case, you might need to mark the natural keys from the different sources so they don’t collide with each other. Try using a character data type for the source natural key, prepended by a source code. For example, if the products in Figure 2 came from SAP or the CRM system, the Product_ID column might contain the following values: SAP|37285, CRM|39101, and SAP|17195. This table then becomes input to the dimension and fact managers.
Durable keys are mandatory for dealing with ambiguities in the source system natural keys. Creating and assigning durable keys allows you to work around business changes to the natural keys, or to integrate duplicate or disparate data. But durable keys are just the start; there is a lot more to deduplicating and data integration.