Intelligent Enterprise Magazine - Data Webhouse Intelligent Enterprise Subscribe Article Index Contacts Resources Write the Editor



 

July 13, 1999, Volume 2 - Number 10

Using our data warehouse experience we can find

ways to manage our highly distributed webhouses

The Data Webhouse Has No Center


Ralph Kimball                


In my last column (“Stirring Things Up,” June 22), I described the profound impact the Web is having on the data warehouse. Every data warehouse must now be made available through Web browser interfaces. The data warehouse is becoming the infrastructure that supports customer relationship management (CRM). And the data warehouse is being asked to make the customer clickstream available for analysis. These forces are changing the way we design and implement the data warehouse. As a signal of these changes, we have renamed my column “Data Webhouse.”

In “Stirring Things Up,” I ended with the thought that the data webhouse has no center, because it is unavoidably distributed. The data webhouse is not only distributed within individual organizations; it is also distributed among multiple organizations. Business partners organized in supply chains are finding the Web a natural medium for information sharing. Again, the data webhouse is often called upon to be the infrastructure for this sharing.

If the data webhouse is so distributed, how do we manage it? Is there any kind of architectural coherence to a large data webhouse? Something in the data webhouse must be centralized; it’s hard to imagine that it could work otherwise. If, in a real environment, many of the participants in a big data webhouse use different database technologies, how can we tie the pieces together at run time in order to deliver query responses and structured reports and analyses?

Basic Architectural Coherence

A big data webhouse spanning multiple organizations and businesses has to have some kind of predictable uniformity. There must be a set of standards that let the disparate parts of the data webhouse recognize each other and communicate. New parts of the data webhouse must be able join onto the existing data webhouse and participate effectively.

Fortunately, this challenge of connecting distributed parts of the data webhouse together is not new. We can bring data warehousing experience to bear on this issue. For several years, we have understood how to connect distributed data marts. First, we commit to a standard dimensional framework for the actual data presentation in all parts of the data webhouse. This framework is based on dimensional modeling, which has been described extensively in the literature. Sec- ond, we designers sit down at the beginning of the data webhouse construction and agree on a series of conformed dimensions and conformed facts that span all the parts of the data webhouse. These conformed dimensions and conformed facts implement what I call the data warehouse bus architecture. I described this architecture in detail in my October 1998 (“Brave New Requirements for Data Warehousing”) and November 1998 (“Coping with the Brave New Requirements”) columns. The data warehouse bus architecture is the basis for a fully distributed data warehouse system, such as the data webhouse.

What Is Physically Centralized

We have already identified a major centralized component of the data webhouse, namely the definitions of the conformed dimensions and the conformed facts. But these definitions are logical, not physical. For example, in most data webhouses, we will decide on conformed dimensions such as calendar (the time dimension), customer, and product. When we actually assign responsibility for constructing the conformed dimensions, however, it is likely that different groups will finally publish each of these dimensions. I have seen many organizations in which one group performed the maintenance of the master customer list, and a different group maintained the master product list.

Thus, the responsibility for each of the conformed dimensions should be vested in a group that I call the dimension authority, which defines, maintains, and publishes a particular dimension to all the data mart clients that connect to or use the data webhouse. The act of publishing is a kind of synchronous replication because all the downstream parts of the data webhouse need to have an identical copy of the dimension at the same time. (See Figure 1.)



FIGURE 1 Five separate groups contribute to a part of a distributed data webhouse. The three groups on the left publish conformed dimensions. The three groups on the right publish unique fact tables to the webhouse and use the conformed dimensions.

The replication of the conformed dimensions outward from the dimension authorities to the downstream client data marts may be the basis for an odd misperception attributed to distributed, noncentralized data warehouses. The noncentralized data warehouse is accused of unnecessarily duplicating data; it is only the dimension data that must be duplicated. No matter what your architectural persuasion, a common structure such as a dimension will always be physically duplicated in a large data warehouse environment, unless the entire data warehouse sits in a single table space in a single vendor’s DBMS. Total physical cen- tralization is not realistic; most companies have many physical table spaces on many computers.

The fact table data, which may compose more than 90 percent of a data warehouse, is explicitly not duplicated. A strong characteristic of the bus architecture is the isolation of the fact table data at unique locations. Local fact providers live all across the data webhouse. A local fact provider is a group that owns and controls a specific source of data. For example, one group may own the order entry system legacy data, and this group makes the order data available to the data webhouse. Another group may own the customer communications legacy data. Each local fact provider converts its own production keys for the calendar, customer, and product dimensions from their system into the keys defined by the data warehouse bus architecture.

Defining the conformed facts is an essential step in defining the conformed dimensions. A good example of a numeric, additive fact that usually must be conformed across separate data marts is revenue. If you measure revenue in several data marts, the technical definition of each of these revenues must be exactly the same so a financial expert would agree that you could add and divide the separate revenues legitimately. If you can’t conform two types of revenue, then you must label them differently to warn the end user and the application developer not to combine them as if they were the same measure.

There is no reason to duplicate fact table data anywhere in the data webhouse. These huge tables only need to have one physical copy. All the disparate end user clients needing the fact data must open connections directly to the one physical copy maintained by the local fact provider group.

Making a Distributed Application Work

You can make a data webhouse that’s based on a distributed set of divergent database technologies work as a seamless whole; the secret is “answer set accretion,” which is just a fancy name for old-fashioned drill across. The query tool or report writer simply opens connections to each of the fact table providers and retrieves separate answer sets constrained and labeled by the conformed dimensions. The use of the conformed dimensions guarantees that the separate answer sets can be combined in the application server above the DBMS, usually with a single pass, sort-merge operation.

Answer set accretion is not the union of separate SQL answer sets. Such a union combines the rows of the separate answer sets. We want to combine the columns of the answer sets under the headers of each answer set row, using outer join logic.

Notice that the technique of fetching the answer sets back from the separate fact providers hides the separate identities of the underlying DBMSs. When the answer sets have arrived at the application server, it doesn’t matter where they came from. An additional bonus is that you can perform the fetching operations against the different remote fact tables in parallel because they are (probably) on different machines. From this discussion, it should be obvious why you need to replicate any updates to a conformed dimension simultaneously to all the affected data marts across the webhouse. If a dimension is inconsistently represented in different parts of the webhouse, there could be a brief period of time where a drill across report would return weird, irreproducible results.

Answer set accretion is an exciting topic. I have been speaking with folks at some small startups who think they can defer the issue of conforming similar dimensions all the way to query time, rather than sorting it out back in the data staging area. This approach is daring, but I think it might work if the two dimensions being combined have exactly the same underlying granularity, or if you can define a mapping among granularities. If this scheme actually bears fruit, you can be sure I’ll write about it.

The Will to Make it Work

Ultimately, a successful, distributed data webhouse hinges on the will to use conformed dimensions and facts. When the data warehouse bus is in place, the technology to perform answer set accretion is straightforward. A number of front-end tool providers can do this if the conformed dimensions and facts exist.

The challenge of defining the conformed dimensions and facts is more organizational and political, rather than technical. It usually isn’t enough just to get all the affected data mart teams together in one room, although that is a heck of a good start. At some point in such a meeting, a senior authority, such as a company CIO, needs to affirm the absolute intention to build the common definitions. If the groups in the room are from different companies in some kind of supply chain, then there must be a combined commitment to successfully complete the task of conforming.



Ralph Kimball, Ph.D., co-invented the Star Workstation at Xerox and founder of Red Brick Systems, works as an independent consultant designing large data warehouses. He is the author of The Data Warehouse Toolkit (Wiley, 1996) and the newly published The Data Warehouse Lifecycle Toolkit (Wiley, 1998). You can reach him through his Web page at rkimball.com.

 

 

Copyright © 2004 CMP Media Inc. ALL RIGHTS RESERVED
No Reproduction without permission
Intelligent Enterprise... subscribe... archives... media kit... resources