Or does it have more to do with sequential processing?
The data staging area is the data warehouse workbench. It is the place where raw data is brought in, cleaned, combined, archived, and eventually exported to one or more data marts. The purpose of the data staging area is to get data ready for loading into a presentation server (a relational DBMS or an OLAP engine). We assume that the data staging area is not a query service. In other words, any database that is used for querying is assumed to be physically downstream from the data staging area.
Perhaps you donýt even realize you have a data staging area. Maybe your data just does a “touch and go” landing in between the legacy system and the presentation server. (Thatýs an airplane metaphor.) You bring the data in briefly, assign a surrogate key, check the records for consistency, and send them on to the DBMS loader that is the presentation database.
If the legacy data is already available in a relational database, then it may make sense to perform all the processing steps within the relational framework, especially if the source relational database and the eventual target presentation database are from the same vendor. This makes even more sense when the source database and the target database are on the same physical machine, or when there is a convenient high-speed link between them.
However, there are many variations on this theme, and in many cases it may not make sense to load the source data into a relational database. In the detailed descriptions of the processing steps, we will see that almost all the processing consists of sorting, followed by a single, sequential pass through either one or two tables. This simple processing paradigm does not need the power of a relational DBMS. In fact, in some cases, it may be a serious mistake to divert resources into loading the data into a relational database when what is needed is sequential flat-file processing.
Similarly, we will see that if the raw data is not in a normalized entity-relationship (ER) format, in many cases it does not pay to load it into an ER physical model simply to check data relationships. The most important data integrity steps involving the enforcement of one-to-one and one-to-many relationships can be performed, once again, with simple sorting and sequential processing. Keeping these thoughts in mind, letýs tease apart as many of the data transformation steps as we can.
In order to enforce referential integrity, we almost always process the dimensions before processing the facts. We take the point of view that we always create a surrogate key for a dimension record before loading it into the final data mart. We will see in the next section that the identification of the correct surrogate key is a simple sequential file lookup.
A significant dimension-processing problem is deciding what to do with a dimension description such as Customer or Product that does not agree with what is already stored in the data warehouse. If the revised description is taken to be a legitimate and reliable update to previous information, then you must use the techniques of slowly changing dimensions. (See “Slowly Changing Dimensions,” April 1996.) If the revised description is merely an informal entry, then the changed fields are ignored.
We need to recognize what has changed in the input data and generate the correct surrogate dimension key. Every data warehouse key should be a surrogate key because the data warehouse DBA must have the flexibility to respond to changing descriptions and abnormal conditions in the raw data. If the actual physical join key between a dimension table and a fact table is a direct derivation of a production key, sooner or later the DBA is going to face an impossible situation. Production keys can be reused or reformatted. Sometimes the dimension value itself must be “unknown.” The most common need for a generalized key is when the data warehouse attempts to track a revised dimensional description and the production key has not been changed.
Deciding What Has Changed
The first step in preparing a dimension record is to decide if we already have the record. The raw data will usually have a production key value. This production key value must be matched to the same field in the “current” dimension record. Note that in the data warehouse dimension this is just an ordinary dimensional attribute. The current dimension record can be found quickly with an updatable flag in the dimension record that identifies whether or not that record is the current one.
This matching of the incoming production key to its partner in the current dimension record can be accomplished with a simple sequential pass over the raw data and the dimension table data, both sorted on the production key. If the incoming dimensional information matches what we already have, then no further action is required. If anything in the incoming dimensional information has changed, then we must apply a type-1, type-2, or type-3 change to the dimension.
- Type 1: Overwrite. We take the revised description in the raw data and overwrite the dimension table contents. For instance, we may receive a corrected customer address. In this case, overwriting is probably the right choice.
- Type 2: Create a new dimension record. We take the previous version of the dimension record, if there is one, and copy it, creating a new dimension record with a new surrogate key. If there is no previous version of the dimension record, then we create a new one from scratch. We then update this record with those fields that have changed and any other fields that are needed. For instance, we may receive a customer record that shows that the customerýs marital status has changed. If we believe this is an actual change, rather than a correction, then we should use this type. The new surrogate key is probably the next sequential integer that represents max(surrogatekey)+1 for the dimension under consideration. In order to speed processing, the max(surrogatekey) value can be explicitly stored as metadata rather than computed each time it is needed.
- Type 3: Push down the changed value into an “old” attribute field. In this case, we have anticipated that an attribute may experience “soft” changes that require a user to refer either to the old value of the attribute or the new value. For example, if a sales team is assigned to a newly named sales region, there may be a need to track the team alternatively in the old region assignment as well as in the new region assignment.
Regardless of which type change we make in processing the incoming dimension data, our choice will always devolve into a single pass sequential file processing step.
Combining from Separate Sources
Complex dimensions are usually derived from several sources. We may need to merge customer information from several lines of business and from outside sources. There will usually not be a universal key that makes this merge operation easy. The raw data and the existing dimension table data may need to be sorted at different times on different fields in order to attempt a match. Sometimes a match may be based on fuzzy criteria; names and addresses may match except for minor spelling differences. In these cases, a sequential processing paradigm is more natural than an equijoin in a relational database. Fuzzy matches are not directly supported in a relational context in any case.
Another common merging task in data preparation is looking up text equivalents for production codes. In many cases, the text equivalents are sourced informally from a nonproduction source. Again, the task of adding the text equivalents can be done in a single pass by first sorting both the raw data and the text lookup table by the production code.
Data cleaning may also involve checking the spelling of an attribute or checking the membership of an attribute in a list. Once again, this is best accomplished by sorting the raw data and the permissible target values and processing them in a single comparison pass.
A useful data quality check, even when there isnýt a defined membership list, is simply sorting the raw data on each text attribute; minor variations in spelling or punctuation will jump out. A sorted list can easily be made into a frequency count. The spellings with low frequencies can be checked, if necessary by hand, and corrected.
Processing Names and Addresses
Incoming name and address fields are often packed into a small number of generic fields such as Address1, Address2, and Address3. The contents of these generic fields need to be parsed and separated into all their constituents. Once names and addresses have been cleaned and put into standardized formats, they can be more easily deduplicated. What appears to be two customers at first turns out to be one. Perhaps one has a P.O. box and the other has a street address, but the rest of the data clearly reveals that it is the same customer. This kind of duplication is best seen when the customer file is sorted repeatedly by different attributes. I prefer sorting with a dedicated sort package such as Syncsort rather than attempting to do it in a relational database.
A more powerful form of deduplicating is householding. In this case, an “economic unit” consisting of several individuals is linked under a single household identifier. The most common case would be a husband and wife who have various single and joint accounts with various slight differences in name spellings and addresses.
Validating One-to-One and One-to-Many Relationships
If two attributes in a dimension are supposed to have a one-to-one relationship, you can check it easily by sorting the dimension records on one of the attributes. A sequential scan of the data will show whether any violations exist. Each attribute value in the sorted column must have exactly one value in the other column. The check must then be reversed by sorting on the second column and repeating. Note that you donýt put this data into an ER schema to enforce the one-to-one mapping. You would have to fix any problems before you loaded the data, which is the whole point of this discussion.
A one-to-many relationship can similarly be verified by sorting on the “many” attribute and verifying that each value has only one value on the “one” attribute.
Like most of the other processing steps, the paradigm is to first sort and then scan the files sequentially exactly once.
The incoming fact records will have production keys, not data warehouse keys. The current correct correspondence between a production key and the data warehouse key must be looked up at load time. The fast lookup of surrogate keys is facilitated by keeping a two-column table that maps all incoming production IDs to current data warehouse surrogate keys. If the fast lookup table can be kept in memory during the key-replacement process, then the fact table may not need to be sorted. If not, the quickest procedure would be to sort the incoming fact records in turn on each production ID and then perform the lookups of the surrogate keys in a single sequential scan.
Each load of new fact records requires that aggregates be calculated or augmented. It is very important to keep the aggregates synchronous with the base data at every instant. If the base data is updated and placed online but a delay ensues before the aggregates are updated, then the aggregates must be taken offline until they are ready. Otherwise, the aggregates will not correctly mirror the base data. In this situation, the DBA has a choice between delaying the entire publishing of the data until both the base data and the aggregates are ready, or releasing the revised base data in a performance-degraded mode while the aggregates are being updated offline.
The creation of aggregates is equivalent to the creation of break rows in a report. If aggregates representing product-category totals are needed, then the incoming data must be sorted by product. Actually, several passes may be required to produce all the category aggregates. If the base fact table data has a granularity of product by store by day, we might want to produce the following aggregates:
- Category by store by day
- Category by region by day
- Category by store by month
- Category by region by month.
If we are loading data every day, then for each day, we can create the first two aggregates entirely from the daily load. Two sorts of the data are required: the first is store/product with products rolling up to categories within each store; the second is region/product with products rolling up to categories within each region.
The first sort can also be used to augment the category-by-store-by-month aggregate by replacing the day surrogate key with the proper month surrogate key and presenting this record to the category-by-store-by-month fact table. By “augment,” we mean that the records generated each day either cause an insert or an update in the monthly aggregate table. We can’t tell whether an insert or an update is required until we present the new aggregate record to the fact table. If a record with the same key is already present, then we do an update, adding the appropriate facts to those already in the target record. If the record with the same key is not in the fact table, then we perform an insert. It is highly desirable to have a DBMS loader that can perform these augmenting loads in a bulk mode. If a sequential program has to check the database record by record in order to decide whether to perform an update or an insert, the load process will run far more slowly.
The Bottom Line: Is Data Staging Relational?
Most data staging activities are actually not relational, but rather they are sequential processing. If your incoming data is in a flat-file format, then you should finish your data staging processes as flat files before loading it into a relational database.
Don’t be fooled by the ability to sequentially process a relational table with a relational cursor and a programming language such as Oracle’s PL*SQL. Perhaps all you have done is turn your relational database into a kind of flat file. This is a little like driving an army tank down to the corner grocery store. You will be stunned by the speed of flat-file sorting and sequential processing.