Ralph Kimball
In my last column (Stirring Things Up, June 22), I described the profound impact
the Web is having on the data warehouse. Every data warehouse must now be made available through Web
browser interfaces. The data warehouse is becoming the infrastructure that supports customer relationship
management (CRM). And the data warehouse is being asked to make the customer clickstream available for
analysis. These forces are changing the way we design and implement the data warehouse. As a signal of
these changes, we have renamed my column Data Webhouse.
In Stirring Things Up, I ended with the thought that the data webhouse has no center, because
it is unavoidably distributed. The data webhouse is not only distributed within individual organizations;
it is also distributed among multiple organizations. Business partners organized in supply chains are
finding the Web a natural medium for information sharing. Again, the data webhouse is often called upon
to be the infrastructure for this sharing.
If the data webhouse is so distributed, how do we manage it? Is there any kind of architectural coherence
to a large data webhouse? Something in the data webhouse must be centralized; its hard to imagine
that it could work otherwise. If, in a real environment, many of the participants in a big data webhouse
use different database technologies, how can we tie the pieces together at run time in order to deliver
query responses and structured reports and analyses?
Basic Architectural Coherence
A big data webhouse spanning multiple organizations and businesses has to have some kind of predictable
uniformity. There must be a set of standards that let the disparate parts of the data webhouse recognize
each other and communicate. New parts of the data webhouse must be able join onto the existing data
webhouse and participate effectively.
Fortunately, this challenge of connecting distributed parts of the data webhouse together is not new. We
can bring data warehousing experience to bear on this issue. For several years, we have understood how to
connect distributed data marts. First, we commit to a standard dimensional framework for the actual data
presentation in all parts of the data webhouse. This framework is based on dimensional modeling, which
has been described extensively in the literature. Sec- ond, we designers sit down at the beginning of
the data webhouse construction and agree on a series of conformed dimensions and conformed facts that
span all the parts of the data webhouse. These conformed dimensions and conformed facts implement what I
call the data warehouse bus architecture. I described this architecture in detail in my October
1998 (Brave New Requirements for Data Warehousing) and November 1998 (Coping with the
Brave New Requirements) columns. The data warehouse bus architecture is the basis for a fully
distributed data warehouse system, such as the data webhouse.
What Is Physically Centralized
We have already identified a major centralized component of the data webhouse, namely the
definitions of the conformed dimensions and the conformed facts. But these definitions are
logical, not physical. For example, in most data webhouses, we will decide on conformed dimensions such
as calendar (the time dimension), customer, and product. When we actually assign responsibility for
constructing the conformed dimensions, however, it is likely that different groups will finally publish
each of these dimensions. I have seen many organizations in which one group performed the maintenance of
the master customer list, and a different group maintained the master product list.
Thus, the responsibility for each of the conformed dimensions should be vested in a group that I call the
dimension authority, which defines, maintains, and publishes a particular dimension to all
the data mart clients that connect to or use the data webhouse. The act of publishing is a kind of
synchronous replication because all the downstream parts of the data webhouse need to have an identical
copy of the dimension at the same time. (See Figure 1.)

FIGURE 1 Five separate groups contribute to a part of a distributed data webhouse. The three
groups on the left publish conformed dimensions. The three groups on the right publish unique fact tables
to the webhouse and use the conformed dimensions.
The replication of the conformed dimensions outward from the dimension authorities to the downstream
client data marts may be the basis for an odd misperception attributed to distributed, noncentralized
data warehouses. The noncentralized data warehouse is accused of unnecessarily duplicating data; it is
only the dimension data that must be duplicated. No matter what your architectural persuasion, a
common structure such as a dimension will always be physically duplicated in a large data warehouse
environment, unless the entire data warehouse sits in a single table space in a single vendors
DBMS. Total physical cen- tralization is not realistic; most companies have many physical table spaces on
many computers. The fact table data, which may compose more than 90 percent of a data warehouse, is
explicitly not duplicated. A strong characteristic of the bus architecture is the isolation of the fact
table data at unique locations. Local fact providers live all across the data webhouse. A local
fact provider is a group that owns and controls a specific source of data. For example, one group may own
the order entry system legacy data, and this group makes the order data available to the data webhouse.
Another group may own the customer communications legacy data. Each local fact provider converts its own
production keys for the calendar, customer, and product dimensions from their system into the keys
defined by the data warehouse bus architecture. Defining the conformed facts is an essential step in
defining the conformed dimensions. A good example of a numeric, additive fact that usually must be
conformed across separate data marts is revenue. If you measure revenue in several data marts, the
technical definition of each of these revenues must be exactly the same so a financial expert would agree
that you could add and divide the separate revenues legitimately. If you cant conform two types of
revenue, then you must label them differently to warn the end user and the application developer not to
combine them as if they were the same measure.
There is no reason to duplicate fact table data anywhere in the data webhouse. These huge tables only
need to have one physical copy. All the disparate end user clients needing the fact data must open
connections directly to the one physical copy maintained by the local fact provider group.
Making a Distributed Application Work
You can make a data webhouse thats based on a distributed set of divergent database technologies
work as a seamless whole; the secret is answer set accretion, which is just a fancy name for
old-fashioned drill across. The query tool or report writer simply opens connections to each of the fact
table providers and retrieves separate answer sets constrained and labeled by the conformed dimensions.
The use of the conformed dimensions guarantees that the separate answer sets can be combined in the
application server above the DBMS, usually with a single pass, sort-merge operation.
Answer set accretion is not the union of separate SQL answer sets. Such a union combines the rows of the
separate answer sets. We want to combine the columns of the answer sets under the headers of each answer
set row, using outer join logic.
Notice that the technique of fetching the answer sets back from the separate fact providers hides the
separate identities of the underlying DBMSs. When the answer sets have arrived at the application server,
it doesnt matter where they came from. An additional bonus is that you can perform the fetching
operations against the different remote fact tables in parallel because they are (probably) on different
machines. From this discussion, it should be obvious why you need to replicate any updates to a conformed
dimension simultaneously to all the affected data marts across the webhouse. If a dimension is
inconsistently represented in different parts of the webhouse, there could be a brief period of time
where a drill across report would return weird, irreproducible results.
Answer set accretion is an exciting topic. I have been speaking with folks at some small startups who
think they can defer the issue of conforming similar dimensions all the way to query time, rather than
sorting it out back in the data staging area. This approach is daring, but I think it might work if the
two dimensions being combined have exactly the same underlying granularity, or if you can define a
mapping among granularities. If this scheme actually bears fruit, you can be sure Ill write about
it.
The Will to Make it Work
Ultimately, a successful, distributed data webhouse hinges on the will to use conformed dimensions and
facts. When the data warehouse bus is in place, the technology to perform answer set accretion is
straightforward. A number of front-end tool providers can do this if the conformed dimensions and facts
exist.
The challenge of defining the conformed dimensions and facts is more organizational and political, rather
than technical. It usually isnt enough just to get all the affected data mart teams together in one
room, although that is a heck of a good start. At some point in such a meeting, a senior authority, such
as a company CIO, needs to affirm the absolute intention to build the common definitions. If the groups
in the room are from different companies in some kind of supply chain, then there must be a combined
commitment to successfully complete the task of conforming.
Ralph Kimball, Ph.D., co-invented the Star Workstation at Xerox and founder of Red Brick
Systems, works as an independent consultant designing large data warehouses. He is the author of The Data Warehouse Toolkit (Wiley, 1996) and the newly published The Data Warehouse Lifecycle Toolkit (Wiley, 1998). You can reach him
through his Web page at rkimball.com.
|