Drilling across separate business processes is one of the most powerful applications in a data warehouse. We often describe drilling across as magic: separately open connections to the dimensional models for each business process, fetch answer sets from each process labeled identically with row headers drawn from specially conformed dimensions, then deliver the result by sort-merging these answer sets over the row headers. It is magic because the original data for each business process can be profoundly distributed on different machines, perhaps even running different styles of DBMSs.
In our classes we introduce drilling across with the following simple example. Suppose we have three processes: manufacturing shipments, warehouse inventory, and retail sales. We have carefully administered the product dimension associated with each of these processes so the product descriptions are exactly consistent. We call this a conformed dimension. In our simple example, there are three products named Framis, Toggle, and Widget.
When we query each process separately, we get the following three answer sets:
The challenge of drilling across is to take these three answer sets and sort merge them on the Product row header, hopefully getting the following result:
Seems simple, doesn’t it? But in real life, if the three business processes are on different machines, where do you perform the drill across? Cognos, Business Objects, and the server layer of OBIEE all can perform their own versions of drilling across if you can find the feature under a different name! But how do you open arbitrary connections to random remote data sources and then perform this sort-merge, especially if you don’t have one of these tools?
In the past, I have told students to study their BI applications and “figure out” what to do. This is unsatisfactory, and I suspect many students have never actually implemented drill-across applications.
I decided to try implementing drill across in Excel, where within a single master spreadsheet, each process’s results would be stored in a separate worksheet. I hoped that I could just use the Pivot command to accomplish the sort-merge, but Pivot is not nearly powerful enough. The sort-merge across many simultaneous data sources is actually a sophisticated series of steps, which in relational parlance is a highly parallelized full symmetric outer join across all the sources. But in general we can’t use an RDBMS because we are pulling from multiple diverse sources, perhaps including OLAP systems or Hive tools in Hadoop environments. Actually, any results set that has the requisite format is fair game for drilling across.
Fortunately I am very comfortable with Visual Basic so I wrote an Excel macro in VB to do drilling across. Once you store the various results into separate worksheets, you run the macro and Presto!
The code for the macro is available at this link. You can download the complete spreadsheet with sample data and the macro at this link. In the code, worksheets for each process are called Process1, Process2, and so forth. Open the worksheet named StartHere to begin. The worksheet for the final result is called DrillAcrossResult.
Coding and debugging this macro took me about six hours over a two day period. I have made it fairly robust. You can drill across 20 processes, where each process can have hundreds or even thousands of rows. The results sets can be labeled with multi-level row headers, and each process can have up to 20 unique associated facts. Because it implements an outer join of the results sets, the final answer is allowed to have blank cells.
Each process is fully sorted by the macro before the drill-across operation. Thus you can even manually add new rows anywhere without requiring that each process be sorted.