|
Data warehouse designers take an implicit oath to represent the past correctly. We promise our users that
if the definition of a customer or product changes slowly over time, we will be very careful to keep the
old definitions in the database and apply them to history correctly. If we work for an insurance company
and we are looking backward in time to see why we approved insurance for a particular cluster of
customers, we must have the correct descriptions of those customers at the moments in the past when we
approved their insurance, not the descriptions that are valid in the present. These customers are now
older and presumably wealthier, wiser, and have larger families. We dont want these revised
descriptions when we try to evaluate an old decision.
Data warehouse designers have long recognized the need for describing the past accurately. In the
dimensional data warehouse, the world divides into fact tables and dimension tables. Fact table data, by
its nature, represents a time series of measurements and is always augmented with an explicit time
dimension. Finding old fact-table data is easy and is one of the standard queries in the data warehouse.
Just constrain the time dimension to the appropriate previous time interval.
But dimension-table data requires more thought. Dimensions dont change in predictable ways.
Individual customers and products evolve slowly and episodically. Some of the changes are true physical
changes. Customers change their addresses because they move. A product is manufactured with different
packaging. Other changes are actually corrections of mistakes in the data. And finally, some changes are
changes in how we label a product or customer and are more a matter of opinion than physical reality. We
call all these variations slowly changing dimensions (SCDs). Over the last decade, data
warehouse designers have sorted out three major approaches to SCDs. We call these Type 1, Type 2, and
Type 3. Briefly: A Type 1 SCD is an overwrite of a dimensional attribute. History is definitely lost.
We overwrite when we are correcting an error in the data or when we truly dont want to save
history.
A Type 2 SCD creates a new dimension record and requires a generalized or surrogate key for the
dimension. We create surrogate keys when a true physical change occurs in a dimension entity at a
specific point in time, such as the customer address change or the product packing change. We often add a
timestamp and a reason code in the dimension record to precisely describe the change. A Type 3 SCD adds
a new field in the dimension record but does not create a new record. We add a new field when we have a
new label for the customer or the product that is a matter of opinion rather than physical reality. Maybe
we change the designation of the customers sales territory because we redraw the sales territory
map, or we arbitrarily change the category of the product from Confectionery to Candy. In both cases, we
augment the original dimension attribute with an old attribute so we can switch between these
alternate realities. Type 3 is distinguished from Type 2 because we could regard both the old and new
descriptions of a Type 3 change as true simultaneously. In the product category example, the addition of
an Old Category field gives us exactly one alternate category reality to switch to. These three types
of slowly changing dimensions handle most of the situations faced by the data warehouse designer. But a
few weird situations remain. Several times in my design consulting I have encountered a situation that
seems to be a hybrid of Type 2 and Type 3. I call this situation many alternate realities.
Predictable Multiple Realities
Consider a situation where a sales organization continually revises the map of its sales districts.
Perhaps it adjusts its sales districts each year to try to adapt to changing market conditions. Over a
10-year period the sales organization accumulates no fewer than 10 different maps. On the surface, this
organization would seem like a candidate for the Type 2 slowly changing dimension. But as the database
designer, you discover during the user interviews that this sales organization has a much more complex
set of requirements. It wants to: Report each years sales using the approved district map
for that year Report each years sales using a district map from an arbitrary different
year Report an arbitrary span of years sales using a single district map from any chosen
year. The most common version of the third requirement would be to report the complete span of years
using todays map. You cannot serve this set of requirements with the Type 2 model because Type 2
perfectly partitions history and a year can only be reported using its assigned unique map. The
requirements cannot be met with a Type 3 model because Type 3 only allows a single alternate
reality, and in this case, we have 10 alternate realities. In this example, we can take advantage
of the regular nature of these 10 alternate realities by generalizing the Type 3 model to have not one,
but 10, versions of the District attribute for each sales team. The sales team dimension would then look
something like: Sales Team Key Sales Team Name Sales Team Physical Address (stays
constant) District1999 (the district assignment for the team in 1999) District1998 (the district
assignment for the team in
1998) District1997 District1996 District1995 District1994 District1993 District1992 Distric
t1991 District1990
plus other unrelated sales team attributes. Each sales team record would have all 10 district
interpretations, and the end user could choose to roll up all sales teams with any of the 10 district
maps. In this design, there is one record for each sales team, and the addition of more realities merely
requires the addition of more district attributes in the original records.
Unpredictable Multiple Realities
Now lets make the problem harder. Suppose that the assignment of each sales team to a district is
not synchronized to calendar years but occurs at random and unpredictable times, and each sales team is
different. Now we will restate the reporting requirements a little. The users want to: Report
district sales at any past instant in time using the assignments that were valid at that
instant Report all sales over all time using whatever todays district map may
be Report all sales over all time using a selected obsolete district map. This design requires
a hybrid of Type 2 and Type 3. We issue a new record for a sales team whenever its district assignment
changes, but we also carry along a Current District attribute in all the versions of each
sales team record that is overwritten whenever the current district map is altered. The sales team record
now looks like: Sales Team Key Sales Team Name
Sales Physical Address (remains unchanged) District (the district assignment valid between the
following dates) Begin Effective Date (the first date this record is valid) End Effective Date (the
last date this record is valid) Obsolete District1 (a selected obsolete definition) Obsolete
District2 (a different obsolete definition)
Current District (the most current district assignment; periodically overwritten)
plus other unrelated sales team attributes.
We administer this design differently. When a sales team is deemed to be part of a new district, we issue
a new record for that sales team. This is a straightforward Type 2 SCD response. We bookkeep the Begin
and End dates correctly. We assume here that the End Effective Date for one assignment of a district is
exactly one day less than the Begin Effective Date of the next assignment. We keep the same value in a
given obsolete definition across all the records that describe a specific sales team. Finally, we also
sweep back through all previous instances of this sales teams records and overwrite the Current
District attribute. The Obsolete and Current attributes are a variation of the Type 3 SCD. Now we can
meet all of the reporting objectives given to us by the end users. When we report all district sales at a
particular instant in time, we must constrain the query in the following way:
Reporting_date >= Begin_effective_date and Reporting_date <= End_effective_date.
You might have been tempted to write:
Reporting_date BETWEEN Begin_ effective_date and End_effective_date
but amazingly, this code is not SQL! SQL allows field-between-values, not
value-between-fields. Fortunately, Oracle and other DBMSs do support nonstandard extensions
to SQL to let you use this intuitive construction without getting scolded by the syntax checker.
Complex Political and Geographic Dimensions
Ive described this last example in terms of a sales organization in order to make the explanation
as clear as possible. A few sales organizations actually are managed this way, especially when the
definition of sales districts are confounded with the definition of sales channels. An interesting
real-world example of this hybrid Type 2/3 dimension is a dimension describing geographic entities over a
period of time. Imagine a map of Europe dating from about 1500. On this map we identify hundreds of small
city states, districts, and portions of countries. Each of these small geographic entities is made into a
record. Then we administer this geographic entity dimension in the same way as the last sales
organization example. This way, we can assemble the little geographic entities into countries at any
point in time from 1500 to the year 2000. By constraining to a particular reporting date, we can
instantly snap in a logical map of Europe at that time. By using the Obsolete fields, we can track the
Hapsburg Empire from its origins in 1500 to the modern day, even though it does not exist today. By using
the Current field, we can track the contemporary definition of the European Union similarly. We can
even augment the geographic entity record with a field that links either to a simple JPEG file that is
the correct Europe map, or links to a geograhic information system (GIS) representation of that same map.
To make this work best, we would need a contiguous set of maps spanning 1500 to 2000 for all the
geopolitical changes that occurred.
This kind of geographic dimension is personally fascinating to me because of the complexity of
representing the complicated changes in the definitions of states and countries over the period from 1500
to 2000. If this interests you as well, then check out a wonderful atlas devoted to this perspective with
dozens of historical maps called The Times Atlas of World History (Hammond, 1989).
Ralph Kimball, Ph.D., co-invented the Star Workstation at Xerox and founder of Red Brick
Systems, works as an independent consultant designing large data warehouses. He is the author of The Data Warehouse Toolkit (Wiley, 1996) and the newly published The Data Warehouse Lifecycle Toolkit (Wiley, 1998). You can reach him
through his Web page at www.ralphkimball.com.
|