There are two powerful ideas at the foundation of most successful data warehouses. First, separate your systems. Second, build stars and cubes.
In my previous column, I described a complete spectrum of design constraints and unavoidable realities facing the data warehouse designer. This was such a daunting list that I worried that you would head for the door. But maybe what kept you reading this far was my promise to dig us out of the morass. This is where the two powerful ideas come in.
Last time, I stated that the nonnegotiable constraints on a data warehouse design were end-user understandability and query execution speed. A complex, slow data warehouse is a failure no matter how elegant the rest of the design may be because the people for whom it was intended won’t want to use it.
All the rest of the constraints and unavoidable realities were pragmatic and honest admissions that the data warehouse design space is extremely complex. The source data, the source systems, the database technologies, and the business environments we must deal with are incredibly complicated. So, as good engineers, to dig our way out of the mire, we must decompose the problem into separate manageable parts and emphasize techniques that are predictable, reusable, and robust when our design environment is altered.
Separate Your Systems
The first, crucial step in designing a data warehouse across a complex enterprise is to separate your systems logically, physically, and administratively.
I’ve found it very useful to think of the project as four distinct and different systems, of which a data warehouse manager should be responsible for only two. I won’t overwhelm you with the typical block diagram because the issues are simpler than that. The four systems are:
- Production (source) transaction processing systems
- Data warehouse staging area systems
- Data warehouse presentation systems, including client/server and Web-based query tools and report writers
- Optional high-end analytic tools supporting data mining, forecasting, scoring, or allocating.
As a data warehouse manager, you shouldn’t be responsible for the source systems that are capturing and processing transactions. That’s someone else’s job. You don’t want to be involved in supporting the legal and financial auditing functions or the rollback and recovery functions of these systems. They are the cash registers of the company, and their priorities are different from those of the data warehouse.
The first system for which the data warehouse is responsible is the data staging area, where production data from many sources is brought in, cleaned, conformed, combined, and ultimately delivered to the data warehouse presentation systems. Much has been written about the crucial extract-transform-load (ETL) steps in the staging area, but stepping away from this detail, the main requirement for the staging area is that it is off limits to all final data warehouse clients. The staging area is exactly like the kitchen in a restaurant. The kitchen is a busy, even dangerous, place filled with sharp knives and hot liquids. The cooks are busy, focused on the task of preparing the food. It just isn’t appropriate to allow diners into a professional kitchen or allow the cooks to be distracted with the very separate issues of the fine dining experience. In data warehouse terms, by barring all data warehouse clients from the data staging area, we avoid:
- Guaranteeing up-time service levels for querying or reporting
- Enforcing client-level security
- Building performance-enhancing indexes and aggregations for query performance
- Handling logical and physical conflicts between the querying and data cleaning steps
- Guaranteeing consistency across separate, asynchronous data sources.
The two dominant data structures in the data staging area are the flat file and the entity/relationship schema, which are directly extracted or derived from the production systems. Almost all processing in the staging area is either sorting or simple sequential processing.
The second major system under the specific control of the data warehouse is the presentation system. Of course, this system is analogous to the dining area of a fine restaurant. The dining area is organized for the comfort of the diners. The food is delivered promptly and in the most appealing way, and hassles and distractions are avoided as much as possible. In the same way, the data warehouse presentation system is purpose-built to enhance the query and reporting experience. The presentation system needs to be simple and fast and present the right data to meet the analysis needs of the end users. Also, in the presentation system, we can easily handle the foregoing bullet list of requirements that we excluded from the staging area.
The dominant data structures in the presentation area are the relational star schema and the online analytic processing (OLAP) data cube. Processing in the presentation area must respond to a blizzard of large and small queries coming in from every possible angle on the data. Over time, there will be no predictable pattern to these queries. Some designers call this the ad hoc attack.
The fourth system on our list is an optional layer of specific high-end analytic tools that often consume data warehouse data in batch quantities. Frequently these data mining, forecasting, scoring, and allocating tools use specialized algorithms outside the normal expertise of the data warehouse designer.
And, honestly, many of these processes have an interpretive or political component that’s wisely segregated from the data warehouse. For example, data mining as a discipline is a complex interpretive task involving a whole collection of powerful analytic techniques, many of which the end-user community doesn’t readily understand or trust. Proper data mining requires a professional data-mining expert who’s equipped to use the tools effectively and represent the data mining results to the community.
In addition, as I’ve often stated, there’s a fundamental impedance mismatch between data mining and the data warehouse. Data mining frequently needs to look at thousands or millions of “observations” over and over again, at extremely high data rates. This just isn’t easy to support from the data warehouse directly. Better to hand the observation set over to the data-mining team, just once.
Another example of a high-end analytic tool the data warehouse should avoid is the allocation system for assigning costs to the various lines of business in your organization so as to compute overall profitability. Not only can this be a complex processing step outside the capabilities of most query and reporting tools, it’s also a political hot potato. Let the finance department do the allocations, and you (the data warehouse) will be glad to store the results.
Symmetrical Stars and Cubes
Most presentation areas today are dominated by relational star schemas and multi-dimensional OLAP data cubes. These data structures have proven over the last 30 years to be the ones end users can understand. Remember that understandability is one of our two nonnegotiable design constraints.
The simplicity of the star schemas and OLAP cubes has allowed smart software designers to focus on very powerful algorithms for rapid query performance. Remember that speed is the other nonnegotiable design constraint.
The symmetry of both the star schema and OLAP cube also makes for:
- Predictable user interfaces that can “learn” what to do at query time
- Predictable administrative scenarios in the staging area because all the data structures have the same familiar look and feel
- Predictable implementation responses whenever new types of data are made available.
Of course, the star schema and OLAP cube are intimately related. Star schemas are most appropriate for very large data sets, with many millions or billions of numerical measurements, or many millions of members in a customer entity or a product entity. OLAP cubes are most appropriate for smaller data sets where analytic tools can perform complex data comparisons and calculations. In almost all OLAP cube environments, it’s recommended that you originally source data into a star schema structure, and then use wizards to transform the data into the OLAP cube. In that way, all the complex staging area ETL tools that deal with flat files and entity/relationship schemas can be part of the OLAP data pipeline. And, of course, hybrid star schema-OLAP systems allow very large data sets in star schema format to be smooth drill-down targets from smaller OLAP data cubes, all under a single user interface.
The Big Payoff
The final big payoff for building the presentation system in the data warehouse around symmetrical star schemas and OLAP cubes is the predictable set of points of commonality for linking together data from across the enterprise.
In my next column, I’ll lay bare the techniques for conforming the definitions of the dimensions and the facts across all the disparate data sources of your large enterprise. These conformed dimensions and facts will be the basis for a data warehouse bus architecture — a set of standard connection points that provide power to your data warehouse, just like the bus bar in a power station provides power to all the transmission lines, and just like the bus in your computer provides data to all the peripherals.
What Have We Accomplished?
So far we’ve implemented two powerful ideas. First we’ve logically, physically, and administratively separated the systems in our environment into four distinct types. You really do need four different computer systems, but you’re responsible for only two of them! Our two data warehouse systems also allow us to separate the incompatible responsibilities of data staging and end-user querying. Second, we’ve populated our data warehouse presentation area with star schemas and OLAP cubes, the only structures that are understandable, and fast, and can stand up to the ad hoc attack.
Although we clearly haven’t addressed all the complex design constraints and unavoidable realities, we’ve chipped away at an impressive part of the overall challenge, just by leveraging these two powerful ideas. Go back to my previous column and look through the list. We’ve effectively addressed large parts of understandability, query speed, all three types of costs mentioned in that column, the risks of inappropriate centralization, the need for incremental development, handling continuous change consisting of little surprises and big surprises, and how to think about the role of data marts. Maybe there’s hope.