Declaring the Grain - Kimball Group

By Ralph Kimball
March 1, 2003

In debugging literally thousands of dimensional designs from my students over the years, I have found that the most frequent design error by far is not declaring the grain of the fact table at the beginning of the design process. If the grain isn’t clearly defined, the whole design rests on quicksand. Discussions about candidate dimensions go around in circles, and rogue facts that introduce application errors sneak into the design.

Declaring the grain means saying exactly what a fact table record represents. Remember that a fact table record captures a measurement. Example declarations of the grain include:

An individual line item on a customer’s retail sales ticket as measured by a scanner device
An individual transaction against an insurance policy
A line item on a bill received from a doctor
A boarding pass used by someone on an airplane flight
An inventory measurement taken every week for every product in every store.

In Business Terms

Notice that most of these grain declarations are expressed in business terms. Perhaps you were expecting the grain to be a traditional declaration of the fact table’s primary key. Although the grain ultimately is equivalent to the primary key, it’s a mistake to list a set of dimensions and then assume that this list is the grain declaration. This is the most common mistake in my students’ designs. In a properly executed dimensional design, the grain is first anchored to a clear business object (no pun intended) and a set of business rules. Then, the dimensions that implement that grain become obvious.

So, when you make a grain declaration, you can then have a very precise discussion of which dimensions are possible and which are not. For example, a line item of a doctor’s bill likely would have the following dimensions:

Date (of treatment)
Doctor (may be called “provider”)
Patient
Procedure
Primary Diagnosis
Location (presumably the doctor’s office)
Billing Organization (an organization the doctor belongs to)
Responsible Party (either the patient or the patient’s legal guardian)
Primary Payer (often an insurance plan)
Secondary Payer (maybe the responsible party’s spouse’s insurance plan).

And quite possibly others.

Powerful Effects

If you’ve been following this example, I hope you’ve noticed some powerful effects from declaring the grain. First, you can visualize the dimensionality of the doctor bill line item very precisely, and you can therefore confidently examine your data sources, deciding whether or not a dimension can be attached to this data. For example, you probably would exclude “treatment outcome” from this example because most medical billing data doesn’t tie to any notion of outcome.

But a general entity/relation-oriented “data model” of doctor visits might well include treatment outcome. After all, in an abstract sense, doesn’t every treatment have an outcome?

The discipline of insisting on the grain declaration at the beginning of a dimensional design keeps you from making this kind of mistake. A model of billable doctor visits that included treatment outcome would look like a dimensional model but it wouldn’t be implementable. This is my main gripe with many of the current offerings of “standard schemas” in books and CDs. Because they have no grain discipline, they often combine entities that don’t coexist in real data sources. Every fact table design must be rooted in the realities of available physical data sources.

A second major insight from the doctor bill line item grain declaration is that this very atomic grain gives rise to many dimensions! I listed 10 dimensions, and experts in health care billing probably know of a few more. It’s an interesting realization that the smaller and more atomic the measurement (fact table record), the more things you know for sure, and the more dimensions you have. This phenomenon is another way of explaining why atomic data resists the ad hoc attack by end users. Atomic data has the most dimensionality and so it can be constrained and rolled up in every way possible for that data source. Atomic data is a perfect match for the dimensional approach.

In the doctor bill line item example with 10 dimensions, you wouldn’t expect the primary key of the fact table to consist of all 10 dimensional foreign keys. Logically from a business rules perspective, perhaps we know that the combination of date, doctor, patient, and procedure are enough to guarantee a unique record. So these fields could implement the fact table primary key. Or, it is possible that we have extra “degenerate” keys for the patient bill number and the line item number that would, by themselves, implement an acceptable physical fact table key. But we are confident that we can add these degenerate dimensions to the design because they are consistent with our grain declaration. The grain declaration is like a contract!

The grain declaration lets us think creatively about adding dimensions to a fact table design that may not obviously be present in the source data. In retail sales data, marketplace causal factors like promotions and competitive effects may be very important to understanding the data, but this information may not be present in a literal sense in the data extract. The grain definition (see the first example in our list of grains) tells us that we can indeed add a causal “store condition” dimension to the fact table as long the store condition descriptions vary appropriately by time, product, and store location. A weather dimension can be added to many fact table designs using the same logic. Once such a new dimension is identified, it is incumbent on the data warehouse designer to find the appropriate store condition or weather data source and insert it into the backroom data staging applications that build the fact tables.

All of the grain declarations listed in this article represent the lowest possible granularity of their respective data sources. These data measurements are “atomic” and cannot be divided further. But it is quite possible to declare higher level grains for each of these data sources that represent aggregations of atomic data:

All the sales for a product in a store on a day
Insurance policy transaction totals by month by line of business
Charged amount totals by treatment by diagnosis by month
Counts of passengers and other flight customer-satisfaction issues by route by month
Average inventory levels by quarter by region.

These higher levels of aggregation will usually have fewer, smaller dimensions. The doctor example might end up with only the dimensions:

Month
Doctor
Procedure
Diagnosis.

It would be nonsensical in an aggregated fact table to try to include all the original dimensions of the atomic data, because you would find yourself recapitulating the atomic level of the data!

Useful aggregations necessarily shrink dimensions and remove dimensions; therefore, aggregated data always needs to be used in conjunction with its base atomic data because aggregated data has less dimensional detail. Some developers get confused on this point, and after declaring that data marts necessarily consist of aggregated data, they criticize the data marts for “anticipating the business question.” All these misunderstandings disappear when aggregated data is made available along with the atomic data from which it’s derived.

Keep Facts True to the Grain

The most important result of declaring the grain of the fact table is anchoring the discussion of the dimensions. But declaring the grain lets you be equally clear about the measured numeric facts. Simply put, the facts must be true to the grain. In the doctor example, the most obvious measured fact would be “billed amount,” relating to the specific line item.

Other facts relating to the treatment that patient received at that time may also exist. But helpful facts, such as the amount billed year-to-date to this patient for all treatments, aren’t true to the grain. When a reporting application combines fact records arbitrarily, these untrue-to-the-grain facts produce nonsensical, useless results. Viewed in this way, these facts are dangerous because they invite the end user to make mistakes. Omit them from the design. Calculate such aggregate measures in your application.

Brief Recap

In summary, try to do your dimensional designs using the following four steps, in order:

1. Decide on your sources of data.

2. Declare the grain of the fact table (preferably at the most atomic level).

3. Add dimensions for “everything you know” about this grain.

4. Add numeric measured facts true to the grain.

About the Author: Ralph Kimball

Ralph Kimball is the founder of the Kimball Group and Kimball University where he has taught data warehouse design to more than 10,000 students. He is known for the best selling series of Toolkit books. He started with a Ph.D. in man-machine systems from Stanford in 1973 and has spent nearly four decades designing systems for users that are simple and fast.