|
Commentary: Facts and Fables about Data Warehousing
Facts and Fables are introduced monthly in our Kimball Group Design Tips.
Fable March 9, 2006
Dimensional data warehouses are appropriate for summary level data only.
Fact
This fable is an echo from data warehousing in the early 1990s, when 10 gigabytes was a big number. In those days everyone talked about summarizing data before loading it into a data warehouse. This hasn’t been best practicefor a dimensional data warehouse or any other kindfor more than a decade. Fine grained atomic data has proven to be the most robust data possible. By definition, such data can withstand every possible “ad hoc attack” from users seeking to constrain their queries using highly specific customer and behavior attributes. You should build every data warehouse at the finest grain possible. Recently, even server-based OLAP systems have become robust enough to hold this kind of data. The beauty of exposing all the levels of data in a dimensional format is that you can start the drill-down process at an aggregated level, but then smoothly descend all the way to specific detail using the same BI tool.
Fable February 9, 2006
Ralph Kimball invented the fact and dimension terminology
Fact
While Ralph played a critical role in establishing these terms as industry standards, he didn’t “invent” the concepts. As best as we can determine, the terms facts and dimensions originated from a joint research projected conducted by General Mills and Dartmouth University in the 1960s. By the 1970s, both AC Nielsen and IRI used these terms consistently when describing their syndicated data offerings. Ralph first heard about “dimensions”, “facts” and “conformed dimensions” from AC Nielsen in 1983 as they were explaining their dimensional structures for simplifying the presentation of analytic information.
Fable October 7, 2005
Normalizing data is a prerequisite for data integration.
Fact
Normalization does NOT deliver integration. Integration requires organizational agreement on matching rules, domain values, and standard labels. Reaching agreement or conformance is the tough part of data integration; normalization is merely a structure for storing the agreed upon results. At best, normalization slightly aids integration as the analyst is probably researching each data element in preparation for the actual conforming step.
Fable September 2, 2005
Instead of deploying Kimball’s conformed dimensions which require people to use the same, consistent names for data, a metadata repository can used to equate data elements having different names in different sources.
Fact
When faced with data integration challenges, some designers believe that a simple intermediate data structure is all that’s needed to “perform translation on the fly”. Unfortunately, true data integration supporting integrated “drill across” reports can only succeed if the textual descriptors (fields) in each separate source are physically altered so they have the same label (column name) and content (data domain values).
Fable August 3, 2005
Dimensional models are fully denormalized.
Fact
Dimensional models combine normalized and denormalized table structures. The dimension tables of descriptive information are highly denormalized with detailed and hierarchical roll-up attributes in the same table. Meanwhile, the fact tables with performance metrics are typically normalized. While we advise against a fully normalized with snowflaked dimension attributes in separate tables (creating blizzard-like conditions for the business user), a single denormalized big wide table containing both metrics and descriptions in the same table is also ill-advised.
Fable July 5, 2005
Data mining and statistical methods can not be utilized if data is structured in dimensional models.
Fact
Data mining is most effective when presented with a rich set of transactional data. Virtually all data mining technologies expect to read data into their environment from a flat observation table. They make no assumptions regarding how the data was structured before loading into the tool. An atomic, transaction-grained dimensional model is an excellent source of data for data mining and statistical analysis.
This fable is an offshoot from the erroneous assumption that dimensional models presuppose the business questions and therefore ignore detailed atomic data. Instead, dimensional models should be built with the most robust and expressive atomic data.
Fable June 3, 2005
Dimensional data is organized differently from relational data.
Fact
This is like saying a Ford and a car are different. The fable results from confusion between the terms normalized and relational. Normalization is a modeling approach to support high volume transactions in a relational database environment. It removes redundancy to process transactions quickly. A dimensional model is designed to support analytical queries and user access. These queries typically involve selecting and aggregating arbitrary subsets of data; rarely do they involve inserts or updates. Normalized and dimensional models are simply different design approaches to solve different problems. Both can be implemented in a relational database.
Fable May 4, 2005
Dimensional data warehouses are appropriate for summary level data only.
Fact
Back in the old days, when 10 gigabytes was a big number, everyone talked about summarizing data before loading it into a data warehouse. This hasnt been best practicefor a dimensional data warehouse or any other kindfor over a decade. Build your system at the finest grain possible: at the level of a transaction, or even sub-transactional if you have a business need. (Examples of sub-transactional include users navigation through a website, or call routing through a telecommunications network.) Over the past five years, even server-based OLAP systems have become robust enough to hold transaction-level data. If your system is on the bleeding edge of data volumes, its as important as ever to use a dimensional model for best query performance and usability.
Fable April 11, 2005
Attributes such as employee age or gender should be treated as degenerate dimensions in the fact table rather than as employee dimension attributes.
Fact
Dont allow textual attributes to clutter fact tables under the guise of degenerate dimensions. Degenerate dimensions are typically reserved for operational control numbers such as invoice, purchase order, or check payment numbers. Youll encounter them in transactional fact tables where they are dimensional keys, but dont join to actual dimension tables.
Fable March 6, 2005
A good way to narrow the scope and control the risk of data warehouse development is to focus on delivering the single report most requested by end users
Fact
Starting with a specific report is a terrible way to build a data warehouse. Data warehouse development risk is concentrated almost entirely in sourcing and transforming the needed data. A high profile report, such as customer profitability or customer satisfaction, may require a dozen different data sources. The users expectations for this report are likely unrealistic; they assume the data warehouse will erase all their existing problems. It is far better to roll out a succession of data marts each based on individual sources of data and use the Data Warehouse Bus Architecture to gradually provide the components of desired high profile reports over time.
Fable February 8, 2005
DM Reviews Ask the Experts says: In a bus architecture, there would be no persistent intermediate database; all ETL processing would be handled by the ETL programs without "parking" the data between source and target.
Fact
The data warehouse bus architecture is a specific structure for integrating data from a variety of sources by using conformed dimensions. Data warehouses based on the bus architecture would routinely stage the data after extraction, cleaning, and final transformation.
Fable January 11, 2005
Dimensional data warehouses are passé. You can cost-effectively substitute the power of a database platform for the design and transformation work required to build dimensional data models.
Fact
Any data warehouse worthy of the name has clean data that fully tracks history. We recommend dimensional models for two reasons: they are easy for business users to understand and navigate, and they are efficient to query. If you have enough computing horsepower, why not use views or some other logical layer to present a dimensional view to users, but keep data in a normalized format thats similar to the source systems and hence easy to maintain? Feasible: probably. Cost-effective: no. First, why waste server cycles at query time? Its more efficient to perform the restructuring work once when the data are loaded. Second, youre not simplifying the overall system. Instead, youre shifting the work from ETL system developers to the front end guys, who need to figure out how to make a normalized schema look like a dimensional model that users can understand. Its the same work either way.
Fable December 7, 2004
Data should be "application neutral" in the data warehouse, meaning the data model should not be built for a specific BI application, such as product profitability.
Fact
As we've said previously, providing atomic data in a dimensional data warehouse delivers maximum flexibility because it doesn't pre-suppose the business question. Product profitability, however, is in a different league. You can't expect business analysts to transform raw atomic data into profitability metrics by performing cost allocations on the fly at query time. For complex, cross-process applications like profitability analysis, the data should be structured in a procedural subsystem of the ETL back room, allowing the business users to execute simpler queries with consistent results.
Fable October 28, 2004
The primary key of a fact table consists of all the referenced dimension foreign keys.
Fact
A fact table often has 10 or more foreign keys joining to the dimension tables primary keys. However, only a subset of the fact tables foreign key references is typically needed for row uniqueness. Most fact tables have a primary key that consists of a concatenated/composite subset of the foreign keys.
Fable October 8, 2004
Since dimensional models are built with a singular focus on a specific group of users or requirements, new dimensional models must be built to accommodate new or additional business requirements.
Fact
This misconception stems from erroneously combining two of our earlier fables: dimensional models are built by business department and dimensional models contain only summarized data. Building dimensional models to support business processes with atomic detail will result in an environment that can respond to a wide variety of requirements. While you must create new fact tables when incorporating new business processes into the data warehouse environment, it is not necessary to build new schema to present the same data to different users or reporting requirements
Fable September 14, 2004
Dimensional warehousing is best suited for no more than two unique source systems because of the complexity of integrating data on the same subject matter from multiple sources.
Fact
The challenges of data integration have nothing to do with the modeling approach. Data integration requires identifying incompatible labels and measures used across an organization, and then reaching a common consensus on how to administer them in a centralized way. Paradoxically, dimensional modeling reveals the labels and measures of a business so clearly that an organization has no choice but address the integration problem directly!
Fable August 26, 2004
Data marts are departmental (labeled with the vocabulary most familiar to each department).
Fact
Atomic data marts should be dimensionally structured based on business processes (such as orders, shipments and payments), not organizational business departments. Each core business process captures/generates unique performance metrics with unique granularity. Common dimensions (e.g., product and customer) are reused across the process-centric marts. If marts are created on a departmental basis for Finance, Marketing and Sales, then the same atomic metrics are replicated repeatedly for each department. Whats the likelihood that the metrics are consistently defined, labeled and populated in the departmental data stores? A departmental approach is highly vulnerable to inconsistent, non-integrated point solutions. Process-centric marts deliver a single version of the truth.
Fable August 2, 2004
Bringing a new data source into a dimensional data warehouse breaks the existing schemas and requires creating new fact tables or data marts.
Fact
If the new data source presents data at the same grain (level of detail) as an existing fact table, then the new data source can be gracefully added to that fact table without altering any existing applications. This is one of the great strengths of dimensional modeling since there are a set of documented and well defined graceful modifications that have this characteristic. If the new data source is at a different grain then a new fact table must be created, but this has nothing to do with the modeling approach. All data representations must create a new entity when a new table with different keys is introduced.
Fable July 12, 2004
A changing dimension (as in SCDs) is only a problem when the model is built on dimensions.
Fact
Time variance is a fundamental issue that must be dealt with in any data warehouse. When the description of a fundamental entity like customer or product changes, the data warehouse must have a systematic approach for recording the change.
Dimensional modeling deals with time variance with the standard design technique known as Slowly Changing Dimensions (SCDs). When normalized models step up to the issue of time variance, they typically add time stamps to the entities in various configurations. These time stamps serve to capture every entity change (just like a type 2 SCD does), but without using a surrogate key for each new row, the query interface must issue a double-barreled join that constrains BOTH the natural key and the time stamp between every pair of tables that must be joined. Not very business user friendly, is it?
Fable June 9, 2004
Normalizing your data across the enterprise automatically solves the data integration problem.
Fact
Normalizing data contributes nothing to integration, except forcing the data analyst to confront all the inconsistencies across various data sources. Data integration is a process outside of the specific modeling approach that establishes common enterprise-wide standard labels and measures. These standard labels and measures can then be represented in either a normalized ERD or a star schema. In the star schema world these labels and measures reside in conformed dimensions and conformed facts, respectively.
Fable May 11, 2004
Dimensional models presuppose the business question and are therefore inflexible.
Fact
Dimensional models with ATOMIC data are independent of the business question and are therefore the most flexible and symmetrical framework for presenting business data.
|