DBMS - August 1996 - Aggregate Navigation With (Almost) No Metadata

Data Warehouse Architect By Ralph Kimball

Aggregate Navigation With (Almost) No Metadata

By Ralph Kimball
DBMS Data Warehouse Supplement, August 1996

Design Requirements For Aggregate Environments.


The single most dramatic way to affect performance in a large data warehouse is to provide a proper set of aggregate (summary) records that coexist with the primary base records. Aggregates can have a very significant effect on performance, in some cases speeding queries by a factor of one hundred or even one thousand. No other means exist to harvest such spectacular gains. Certainly the IS owners of a data warehouse should exhaust the potential for performance gains before investing in new hardware. The benefits of a comprehensive aggregate building program can be realized with almost every data warehouse hardware and software configuration, including all of the popular relational DBMSs such as Oracle, Red Brick, Informix, Sybase, and DB2, as well as uniprocessor, SMP, and MPP architectures. I describe how to structure a data warehouse to maximize the benefits of aggregates, and how to build and use those aggregates without requiring complex accompanying metadata.

The basics of aggregate navigation were discussed in detail in my November 1995 column. The main points were:

  • In a properly designed data warehouse environment, multiple sets of aggregates are built, representing common grouping levels within the key dimensions of the data warehouse. I assume a standard dimensional database structure (commonly called a star join schema) exists in which a large central fact table is surrounded by a single level of independent dimension tables. This type of schema is illustrated in Figure 1.
  • An aggregate navigator is a piece of middleware that sits between the requesting client and the DBMS.
  • An aggregate navigator intercepts the client's SQL and, wherever possible, transforms base-level SQL into aggregate-aware SQL.
  • An aggregate navigator understands how to transform base-level SQL into aggregate-aware SQL, because the navigator uses special metadata that describes the data warehouse aggregate portfolio.

    High-Level Goals and Risks

    The goal of an aggregate program in a large data warehouse must be more than just improving performance. A good aggregate program should:

  • provide dramatic performance gains for as many categories of user queries as possible
  • add only a reasonable amount of extra data storage to the warehouse -- "reasonable" is in the eyes of the DBA, but many data warehouse DBAs strive to increase the overall disk storage for the data warehouse by a factor of two or less
  • be completely transparent to end users and to application designers except for the obvious performance benefits; in other words, no end-user application SQL should reference the aggregates directly
  • benefit all users of the data warehouse, regardless of which query tool they use
  • impact the cost of the data extract system as little as possible; it is inevitable that a lot of aggregates must be built every time data is loaded, but the specification of these aggregates should be as automated as possible
  • impact the DBA's administrative responsibilities as little as possible; the metadata supporting aggregates should be very limited and easy to maintain

    A well-designed aggregate environment can achieve all of these objectives. A poorly designed aggregate environment can fail all of the objectives. In the remainder of this article, I provide four design requirements that, if followed, will achieve all of your desired objectives.

    Design Requirement #1

    Aggregates must be stored in their own fact tables, separate from the base-level data. In addition, each distinct aggregation level must occupy its own unique fact table.

    The separation of aggregates into their own fact tables is very important. First, the aggregate navigation scheme I describe is simpler when the aggregates occupy their own tables, because the aggregate navigator can learn almost everything it needs from the DBMS's ordinary system catalog, rather than requiring additional metadata. Second, an end user is less likely to double-count additive fact totals accidentally when the aggregates are in separate tables because every query against a given fact table will by definition go against data of a uniform granularity. Third, the small number of giant numerical entries representing, for instance, national sales totals for the entire year do not have to be shoehorned into the base table. Often the presence of these few giant numbers forces the database designer to increase the field widths of all entries in the database, thereby wasting disk storage. Because the base table is a huge table occupying perhaps half of the entire database, it is very helpful to keep its field widths as tight as possible. Fourth, the administration of aggregates is more modular and segmented when the aggregates occupy separate tables. Aggregates can be built at separate times, and, with an aggregate navigator, individual aggregates can be taken offline and placed back online throughout the day without impacting other data.

    Design Requirement #2

    The dimension tables attached to the aggregate fact tables must, wherever possible, be shrunken versions of the dimension tables associated with the base fact table.

    In other words, using the base-level Fact table in Figure 1, you might wish to build category level aggregates, representing the Product Dimension rolled up from the individual product to the category. (See Figure 2.)

    Notice that in this example you have not requested aggregates in either the Time Dimension or the Store Dimension. The central Sales Fact table in Figure 2 represents how much of a category of product was sold in each store each day. Your design requirement tells you that the original Product table must now be replaced with a shrunken product table, which you might as well call "category." A simple way to look at this shrunken product table is to think of it as containing only those fields that survive the aggregation from individual product up to the category level. Only a few fields will still be uniquely defined. For example, both the Category description and the Department description would be well defined at the category level, and these must have the same field names they have in the base Product Dimension table. However, the individual UPC number, the package size, and the flavor would not exist at this level, and must not appear in the Category table.

    Shrunken Dimension tables are extremely important for aggregate navigation, because the scope of any particular aggregation level can be determined by looking in the system catalog description of the shrunken table. In other words, when you look in the Category table, all you find is Category description and Department description. If a query asks for product flavor, you know immediately that this aggregation level cannot satisfy the query, and thus the aggregate navigator must look elsewhere.

    Shrunken Dimension tables are also attractive because they let you avoid filling the original Dimension tables with weird null values for all of the dimension attributes that are not applicable at the higher levels of aggregation. Because you don't have flavor and package size in the Category table, you don't have to dream up null values for these fields, and you don't have to encode user applications with tests for these null values.

    Although I have focused on shrunken Dimension tables, it is possible that the Fact table will also shrink as you build ever higher levels of aggregation. Most of the basic additive facts such as dollar sales, unit sales, and dollar cost will survive at all levels of aggregation, but some dimensions such as promotion and some facts such as promotion cost may only make sense at the base level and may need to be dropped in the aggregate Fact tables.

    Design Requirement #3

    The base Fact table and all of its related aggregate Fact tables must be associated together as a "family of schemas" so that the aggregate navigator knows which tables are related to one another.

    Any single schema in the family consists of a Fact table and its associated Dimension tables. There is always exactly one base schema that is the unaggregated data, and there will be one or more aggregate schemas, representing computed summary data. Figure 1 is a base schema and Figure 2 is one of perhaps many aggregate schemas in your family.

    The registration of this family of Fact tables, together with the associated full-size and shrunken Dimension tables, is the sole metadata needed in this design.

    Design Requirement #4

    Force all SQL created by any end user or application to refer exclusively to the base fact table and its associated full-size dimension tables.

    This design requirement pervades all user interfaces and all end-user applications. When users examine a graphical depiction of the database, they should only see the equivalent of Figure 1. They should not be aware that aggregate tables even exist. Similarly, all hand-coded SQL embedded in report writers or other complex applications should only reference the base fact table and its associated full-size dimension tables. In those environments in which ad hoc query tools let the end users see every table in the system, it is a good idea to place the aggregate tables in a separate database to hide them from end users. Because the aggregate navigator is maintaining its own connections to the DBMS, this should not present a technical problem.

    Aggregate Navigation Algorithm

    Assuming that you built your dimensional data warehouse according to these four design requirements, you are now in a position to understand how aggregate navigation works.

    The aggregate navigation algorithm is very simple. It consists of only three steps:

    1. For any given SQL statement presented to the DBMS, find the smallest fact table that has not yet been examined in the family of schemas referenced by the query. "Smallest" in this case means the least number of rows. Finding the smallest unexamined schema is a simple lookup in the aggregate navigator metadata table. Choose the smallest schema and proceed to step 2.

    2. Compare the table fields in the SQL statement to the table fields in the particular Fact and Dimension tables being examined. This is a series of lookups in the DBMS system catalog. If all of the fields in the SQL statement can be found in the Fact and Dimension tables being examined, alter the original SQL by simply substituting destination table names for original table names. No field names need to be changed. If any field in the SQL statement cannot be found in the current Fact and Dimension tables, then go back to step 1 and find the next larger Fact table. This process is guaranteed to terminate successfully, because eventually you arrive at the base schema, which is always guaranteed to satisfy the query.

    3. Run the altered SQL. It is guaranteed to return the correct answer because all of the fields in the SQL statement are present in the chosen schema.

    The beauty of this algorithm is that almost no metadata is required to support general navigation. The metadata amounts to only one row for each Fact table and Dimension table in the aggregate schemas. The maintenance of the metadata does not require complex logical modeling. Only the existence of the shrunken Fact and Dimension tables must be recorded.

    In actually implementing this algorithm, several points are worth noting. In step 2, only the shrunken Dimension tables and the Fact table must be examined. If a given schema uses a base-level Dimension table, then its fields do not need to be searched because a match is guaranteed. For example, in our aggregate schema shown in Figure 2, the Time and Store Dimension tables are not shrunken. Any SQL reference to a field in one of these tables does not need to be checked. Only references to fields in the Product table and the Fact table itself must be checked. In the case of Figure 2, the check will be successful only if the references to the Product table are restricted to the category description, the department description, or both. In a successful match, the final SQL differs from the original SQL only by the substitution of "category" for the table name, "product" for the dimension, and "sales_fact_ agg_by_category" for the table name "sales_fact." The original SQL is:

    select p.category, sum(f.dollar_sales), sum(f.dollar.cost)
    from sales_fact f, product p, time t, store s
    where f.product_key = p.product_key
      and f.time_key = t.time_key
      and f.store_key = s.store_key
      and p.category = 'Candy'
      and t.day_of_week = 'Saturday'
      and s.floorplan_type = 'Super Market'
    group by p.category
    

    The preceding query asks for the total dollar sales and total dollar cost of all candy sold in supermarket stores on Saturdays. The aggregate navigator scans this SQL and first looks at the aggregate "agg_by_all" in Figure 3 because that aggregate is the smallest, having only 200 Fact table rows. However, this aggregate fails because the constraints on time (day_of_week) and store (floorplan_type) both violate the agg_by_all schema. Day_ of_week cannot be found in the Month aggregate Dimension table, and floorplan_type cannot be found in the Region aggregate Dimension table.

    Then the aggregate navigator tries the next larger schema, namely "agg_by_category." This time it is successful. All of the fields in the SQL query match. In particular, the product references to "category" are found in the shrunken Product table named "Category." Now the aggregate navigator replaces the table references and produces the following SQL. Only the italicized items are different:

    select p.category, sum(f.dollar_sales), sum(f.dollar.cost)
    from sales_fact_agg_by_category f, category p, time t, store s
    where f.product_key = p.product_key
      and f.time_key = t.time_key
      and f.store_key = s.store_key
      and p.category = 'Candy'
      and t.day_of_week = 'Saturday'
      and s.floorplan_type = 'Super Market'group by p.category
    

    The most straightforward implementation of the aggregate navigation algorithm would decompose the SQL query and look up each field name in step 2 of the algorithm. Each such lookup would be a SQL call to the DBMS's system tables. This is not a crazy approach because such calls are quite efficient and should run in a few hundred milliseconds each. However, in a large and complex data warehouse environment, a practical consideration arises. Calls to the DBMS system tables may take several seconds each, rather than several hundred milliseconds. If six or eight layers of aggregate tables exist, the aggregate navigator may take 20 seconds to determine the correct choice. This is an example of snatching defeat from the jaws of victory.

    A better approach is to cache the system tables in the aggregate navigator so that the lookup of prospective field names does not require a SQL call to the DBMS. This approach is wonderful from a performance point of view, but it makes the aggregate navigator somewhat more difficult to design. First, the navigator must be able to read and store complicated system table configurations including, in the worst case, those with potentially thousands of fields scattered across hundreds of tables. Of course, you restrict your readout of the system tables to only those fields named in your aggregate navigator metadata table. Second, the navigator must have some idea of when to return to the real DBMS system tables to refresh its representation.

    The aggregate navigation algorithm has certain limitations. Note that I am assuming that Kimball's law for star join queries is being obeyed. Kimball's law states that "no star join query will ever mention more than one fact table in the from clause." The basis of this law is the presumption that any DBMS that attempts to join multiple huge fact tables together in a single query will lose control of performance. Fortunately, few, if any, end-user query tools let users easily combine multiple Fact tables. In the vast majority of cases, the query is a simple set of joins between a single Fact table and a suite of Dimension tables. So this is not a serious restriction.

    A second limitation is that each aggregation is "complete." For example, our Category aggregate table in Figure 2 contains summaries for every possible aggregate. The aggregate tables are not restricted to a subset of values. Suppose the complete list of categories had 10 names. Our Category table must always contain entries for all 10. You could not build a category table for just the categories "Snacks" and "Candy." It would be difficult to generalize the algorithm to handle a subset of values, because there is no obvious representation of a subset of values that could be quickly stored and that it could be compared against. For example, with Snacks and Candy, do you store the text names of the categories, their underlying key values, or a more complex criterion such as "in the Convenience Department, but not including Hardware"? And how would you handle very, very long lists with hundreds or even thousands of entries?

    My design approach has one major and unexpected benefit. As you ascend the various dimension hierarchies, you are likely to intersect the natural levels at which corporate planning and budgeting occur. For instance, in Figure 3, I show an aggregate Fact table that could easily arise from the aggregate construction process. This table shows sales at the category by region by month level. In Figure 3, all three of the dimensions are shrunken. Although this is a small table physically, it may be accessed frequently for high-level analyses. For completely coincidental reasons, it is likely that the corporate planning (or budgeting) process is producing plans at the same level, namely category by region by month.

    A goal of most data warehouses is to compare plans to actuals. This is very efficient if these comparisons are intra-record computations. When you notice the serendipitous correspondence between your aggregate Fact table at the category by region by month level and the planning table at the same level, take the opportunity to combine these two tables. You should add the italicized fields in Figure 3. At a certain high level of the aggregate hierarchy, planning (or budgeting) data "snaps in" and resides in the same table.

    The addition of planning or budgeting data at high levels of aggregation requires generalizing your algorithm somewhat. You now have fields that don't exist at the base level. The search for tables that match all of the fields in the query becomes more complicated. The end user's interfaces now must know about planning (or budgeting) fields. However, these fields can only be used successfully in the example at the category by region by month level and above. The aggregate navigator must now be able to detect when its search for fields in successively lower-level fact tables fails. For instance, if a user tries to compare planned and actual results at the level of product flavor, the system will deny the request because planning is not performed at the flavor level.

    Aggregates for Everyone

    The aggregate architecture described in this article allows for a very flexible administration. Aggregates are always behind the scenes. Neither the users nor the application developers need ever be aware that specific aggregations have been built. Aggregates can be added or removed by the DBA, even on an hourly basis. A good aggregate navigator should be accompanied by query usage statistics that guide the DBA in building new aggregates. If a group of queries are observed running slowly and all of them are requesting sales summaries by calendar quarter, then the DBA should be alerted to this pattern and begin thinking about building quarterly aggregates. Note that quarterly aggregates would almost never be needed if a "close by" aggregate, such as month aggregates, already exists. There would be little advantage in providing a new level of aggregate that only represented a rollup of a factor of three from an existing aggregate. A quarterly aggregate would make much more sense if the next lower time aggregate was by week, in which case the quarterly aggregate would offer a 13-fold advantage.

    It is very important that the aggregate navigator benefit is available to all clients of the data warehouse. It is not acceptable to have aggregate navigation built into only a single client tool. In such a case, some of the end users would experience the benefits of navigation. Because it is practically impossible to restrict a large corporate environment to using one end-user tool, an embedded navigator approach would be an administrative headache. The worst scenario would involve two or more incompatible navigator schemes, potentially with different aggregate table structures and different metadata. The aggregate navigator must be a corporate network resource and must be a uniform DBMS front end for all query clients.

    This article describes a simple but powerful architecture for aggregate navigation. If DBAs would adhere to the design requirements I've outlined, and if one or more DBMS or middleware vendors would market an aggregate navigator similar to the one I describe, your databases would speed up, your DBAs would spend their time on activities other than fighting with aggregates, and your metadata maintenance responsibilities would drop drastically. This approach requires almost no metadata.

    If the vendors don't provide this benefit, perhaps you should build the navigator yourself. It should take approximately two weeks of competent C programming plus some testing. I built a navigator that does everything described in this article inside my StarTracker query tool in three days of Visual Basic programming. Anyone game?


    Ralph Kimball was co-inventor of the Xerox Star workstation, the first commercial product to use mice, icons, and windows. He was vice president of applications at Metaphor Computer Systems, and is the founder and former CEO of Red Brick Systems. He now works as an independent consultant designing large data warehouses. You can reach Ralph through his Internet Web page at http://www.rkimball.com.

    Figure 1


    A Star Join Schema with a central Fact table surrounded by four Dimension tables.

    Figure 2


    The aggregated star join schema derived from Figure 1 by aggregating the product dimension up to the category level.

    Figure 3


    The three way aggregate table formed by aggregating time to month, product to category, and store to region. The italicized planning facts are added because they match this level.


    Table of Contents - August 1996 | Home Page
    Copyright © 1996 Miller Freeman, Inc. ALL RIGHTS RESERVED
    Redistribution without permission is prohibited.
    Please send questions or comments to mfrank@mfi.com
    Updated Monday, August 12, 1996