Design Tip #149 Facing the Re-Keying Crisis
October 2, 2012
© Kimball Group. All rights reserved.
Have you been updating type 1 dimension attributes to reflect current values when the business declares accurate historical context must be preserved (including retroactively for the existing data already loaded into the data warehouse)? How do you tackle the re-keying of an existing dimensional model?
Imagine you’re a bank with a mature, dimensional data warehouse. Your current customer dimension has an attribute for credit score, which is initially populated when the customer begins their relationship with your bank. Credit score is managed as type 1 (restate history by updating in place), and updates trickle in sporadically as customers apply for new loans or other credit products.
There are some problems with this design that gradually turn into a crisis. Analysts can’t tell whether the credit score is up to date or 10 years old, because it’s managed as type 1. Users point out they sometimes want to know the original credit score, and other times they want to know the current credit score, and some of them even want a history of customers’ movement from one credit score band to another. The good news, if you can call it that, is that 5 years of detailed historical data is available.
A revised design for the customer dimension could include 3 new attributes:
- Original credit score (no longer type 1 but unchanged from the original application)
- Current credit score (type 1 updated quarterly from the new data source)
- Credit score band (poor, fair, good, excellent; type 2 updated quarterly from the new data source. Propagate a new row only if the customer moves from one band to another.)
Populating the dimension isn’t trivial. The current credit score is easy, because you have a nice clean source from the credit history source. Original credit score could be surprisingly difficult, because most of the bank’s systems have overwritten that score, but some sleuthing may uncover an adequate source.
Obtaining the data for the credit score band is easy, but integrating that information into the customer dimension is challenging. You don’t want to propagate a new row for each customer every quarter, but only if the customer moved from one credit score band to another. And you need to interweave these new rows propagated by credit score band changes with the existing type 2 customer dimension rows, potentially making many adjustments to the begin and end effective dates in the dimension records.
Finally we come to the problem that is the heart of this Design Tip: all the fact tables that refer to the customer dimension need to be altered. You’ve rebuilt your dimension table, adding new rows for historical changes in credit score band, but the production fact tables have keys pointing to the old customer dimension. All the keys are different now. And remember that most fact tables have orders of magnitude more rows than dimension tables, so any UPDATE operation is extremely unattractive.
Rather than updating a fact table in place, we create a new empty fact table with exactly the same structure as the old fact table. The only thing that’s different is the contents of the customer key column. Intellectually the problem is simple: write a query that joins the old fact table to the new dimension table on the transaction system account number and transaction date, picking up the new surrogate key. You’re just pouring water from a full bucket to an identical empty bucket. You may be surprised at how fast your RDBMS can perform this operation: loading an empty, un-indexed table is extremely fast.
Best practice is to have a test system that’s as identical as possible to the production system. Re-key on the test system, which may take several days, then run incremental loads to catch up to real time. When the weekend comes, backup and restore from test to production. If that’s not feasible, you’ll have to do a table-by-table truncate and simple unload/reload from test to production.
If your data warehouse is currently under development, or in production but small in scope, it makes sense to build an ETL job to re-key each fact table. As already described, the target fact table structure is unchanged; you simply need a job to efficiently load data into an empty table in the test system, using a query that joins the fact table to the dimension table being changed. And of course you need a procedure to test the results and move the new data into production. Moving forward, the exit criteria for releasing a new fact table should include a template job to re-key that fact table.
If your data warehouse is in production and large – dozens or even hundreds of fact tables – it will seem overwhelming to develop a custom ETL job for each fact table. It would be much more cost effective to figure out how to automate the process, by writing a single template job that can be extended to any fact table via metadata or system catalogs. Remember that a fact table can join to a dimension table multiple times. Finally, most data warehouses have one or two fact tables that are much larger than the rest. You will probably develop a custom solution for your very largest fact tables, even if you automate most of the work.
Quite frequently, we see the DW team decide to change an attribute from type 1 to type 2, but only track the type 2 changes moving forward. In effect, the DW team is telling the business users that they need to wait months or years before they can do the historical analysis they’re interested in. Or, the business users need to pull down a bunch of data locally, building their own local mart that tracks history the way they want – not a solution that we want to encourage!
Of course in an ideal world the DW team should make copious use of type 2 attributes so this re-keying crisis doesn’t occur. It can be seductive to think that “we can make it a type 2 in the future” but as this Design Tip shows, doing it in the future is harder than getting it right at the beginning.