It’s Time for Time
The importance of the time dimension in data marts and data warehouses.
The time dimension is a unique and powerful dimension in every data mart and enterprise data warehouse. Although one of the tenets of dimensional modeling is that all dimensions are created equal, the truth is that the time dimension is very special and must be treated differently from the other dimensions. This month, I’ll look at some issues involving the time dimension.
Basic Time Issues
Virtually every data mart is a time series. In a dimensional design, the fact table is the central table with a multipart key that records numeric measurements of the data mart’s process. Figure 1 is the familiar basic dimensional design I have shown many times in this column. In this example, the fact table records daily orders received by a manufacturing company. The first dimension, in the upper left corner, is the time dimension. In this case the time dimension designates calendar days.
Data architects in my design class ask, “Why can’t I just leave out the time dimension? The foreign key in the fact table can easily be an SQL date value, and I can just constrain on calendar periods through the mechanisms of standard SQL. This way I avoid an expensive join!” They also ask, “If I have to have a time dimension, where do I get it?”
The first question is answered by recalling the fundamental reason for wanting dimension tables in a data mart. The dimension tables serve as the source of constraints and as the source of report row headers. A data mart is only as good as its dimension tables. Figure 2 shows a recommended data mart time table expressed at a daily grain. If you don’t have lots of good descriptive attributes in your dimensions, then you have a crippled data mart. You won’t be able to constrain on inadequate dimensions, and you won’t be able to construct the reports you want. Although it is true in the case of the time dimension that SQL provides some minimal assistance in navigating dates, standard SQL functionality isn’t nearly complete enough to support the needs of a typical organization. SQL certainly doesn’t know anything about your corporate calendar, your fiscal periods, or your seasons. It is so easy to add these attributes to a time table that the data mart architect should never consider having this calendar logic embedded in end users’ applications.
The second data architect question is answered by saying, “Build it in a spreadsheet.” Unlike nearly any other dimension in a data mart, the time dimension can be built once on the architect’s PC and then uploaded into all of the data mart machines making up the overall enterprise data warehouse. The time dimension in Figure 2 can be built in half a day of fussing with spreadsheet cells and studying the official corporate calendar. It would be reasonable for this time dimension to contain at least 10 or 20 years worth of days, perhaps extending from 1990 to 2010.
Some data marts additionally track time of day to the nearest minute or even the nearest second. For these cases I recommend separating the time of day measure out as a separate “numeric fact.” It should not be combined into one key with the calendar day dimension. This would make for an impossibly large time table.
Intermediate Time Issues
Time is a dimension that invites incompatible rollups; the most obvious are weeks and months. Separate fact tables denominated in weeks and months should be avoided at all costs. Weekly and monthly data is forever incompatible. In last month’s column, I stated that one of the central data warehouse team’s responsibilities was to make sure that all of the dimensions used by separate divisional data marts were conformed. One of the most important conformation steps is to make sure that a uniform time rollup is used in all of the separate data marts.
The most flexible base on which to start is the daily time grain. Daily data rolls up to virtually every possible calendar. If your company operates on calendar month reporting, then you should organize around days, months, quarters, and years. If your company operates on artificial fiscal periods that roll up from weeks, such as a five-week, four-week, four-week quarter, then you should organize around days, weeks, fiscal periods, quarters, and years. It is okay to overlay incompatible seasonal periods based on specific spans of days onto either one of these schemes, but seasonal periods such as Christmas or Easter will not roll up to any of the other calendar periods. Adding seasonal interpretations to the calendar is easy if you start at a daily time grain. This is one of the advantages of the dimensional approach. It is very robust in the face of changing business requirements, as long as you start at the lowest atomic grain of data — in this case daily.
Another issue faced by every data mart designer is what to do about “slowly changing dimensions.” A data mart designer might be tempted to model a slowly changing dimension such as Product as a kind of fact table with a time key in addition to the original product key, as shown in Figure 3. This is definitely not recommended. In most cases, any constraint placed on the secondary time dimension connected to the Product table is different from the constraint placed on the primary time dimension connected to the fact table. The meaning of the time stamp in the product table has to do with a revision date for that version of the product. The meaning of the time key in the fact table has to do with when the activity in the fact table record occurred. These two times may be uncorrelated. For example, a product stamped with a version date of January 1 may be sold in a store on February 15. The product in question may even have been superseded by a new version of the product with a version stamp of February 1. The moral of this story is: Don’t get tangled up with multiple time dimensions hanging off every regular dimension table. It is tempting to tie all of these time constraints together in the name of some kind of “temporal referential integrity,” but this approach makes the application very complex and is semantically incorrect.
Another intermediate level design issue involving time that I have mentioned in the past is the very common issue of aggregating nonadditive measures of intensity across time. The most common examples are inventory levels and account balances. The problem is that in most cases you want to aggregate these measures across time by computing the “average over time.” This is not the same calculation as SQL avg, which is the average over all of the records returned to the answer set for a particular SQL select statement. The end-user application designer must calculate “average over time” by first determining the cardinality of the constraint on the time dimension by itself, saving this number away, then adding the intensity measure across time, and finally dividing out by the time cardinality that has been saved away. Although this is trivial mathematics, it is a significant pain in a standard SQL environment that has no avgperiodsum operator to handle this case automatically. Standard ad hoc query tools that emit single SQL select statements simply cannot perform “average daily balance” calculations, although sophisticated application development environments know how to do it.
A final intermediate-level design issue involving time is whether to design a data mart around individual transactions or around month-end snapshots. In businesses that are transaction-intensive, such as insurance, this can be a real dilemma. Insurance claims-processing transactions are extremely interesting and have a very rich texture. Many complex questions in insurance can only be answered by having access to individual transactions. For instance, what is the average length of time between the original claim and the first payment to the claimant? On the other hand, the record of the individual transactions is a poor basis for creating a month-end report for management. To create such a report, an application would have to go back to some starting point in time and sequentially process all the intervening transactions. Not only is this very inefficient, but it isn’t a SQL query. It is some kind of complex procedural application.
The data mart designer in a transaction-rich business such as insurance will eventually have to build two versions of the data mart: a transaction version and a monthly snapshot version. In this way, the two kinds of analysis can each be accomplished with simple applications. Also, in many cases, snapshots at intermediate points in time (between month ends) can be constructed by starting with the previous snapshot and playing just the current month’s transactions forward to the desired point in time.
Advanced Time Issues
Although there are many interesting advanced time-processing issues, I want to describe two applications that I have seen repeatedly in data warehouse environments. The first is “time alignment of similar events.” For instance, if you have a large data mart of customer purchases, you may be interested in asking questions about a group of customers that are defined by some event. Perhaps the customers are defined by having their credit limit raised to $1,000. Once this triggering event occurs, you want to study the behavior of this cohort group. You want to measure their purchases and their payments as a function of the time after the granting of credit. You may want to ask what is the average time until these customers have a credit default (if they do).
To ask these kinds of questions, you must first identify the customers in question and then align their purchase time histories relative to the triggering event. A very simple approach would be to add an additional time dimension to their purchase history. This time dimension would be aligned with the triggering event. Although this is logically simple, it is impractical. There may be many different analyses being performed on the customer purchase file, and each analysis cannot have its own time dimension. In most cases, the application designer is forced to extract the time history of each customer in the “study” into a new private fact table and keep that fact table on the disk physically for the duration of the analysis, which could be days, weeks, or months. To make this practical, the application designers and analysts need a comprehensive application that creates, manages, and queries these tables. All of the systems like this I have seen have been built from scratch in Visual Basic or PowerBuilder. I know of no application development or decision-support environment that handles this kind of application automatically.
The final advanced time application I have seen repeatedly is the “progressive subsetting” application. This is a more complex version of the previous example. Progressive subsetting is popular in medical outcome analysis. The medical researcher may have a comprehensive data mart of patient diagnoses, treatments, and outcome events. The researcher wants to ask a series of questions that gradually subset an original patient population down to a small final set of patients who have experienced a particular sequence of events. For instance, the researcher might want all of the patients who originally complained of chest pain, and who were treated within one month of the complaint with either drug A or drug B, and who did not subsequently have surgery, and who did have pain three months later, and who are still alive today. Not only does this application need to align similar events as in the preceding example, but the analyst needs powerful query logic for finding patients where events did not happen, and for combining groups of patients with set logic including union, intersection, and set difference. Again, the only systems I know about that tackle these issues seriously are home-grown systems, typically written in Visual Basic or PowerBuilder.
I have tried to provide some perspective on the main time issues confronting the data mart designer and how to tackle each of them. Although the dimension table for time is one of the simplest and most obvious tables in any data mart, the surrounding design issues for handling time are very challenging. In my classes, when I ask the students, “What is the first dimension we might choose for this particular data mart?” they soon learn to yell in unison: “TIME!”