Predictive analytics is the name for a broad range of analysis techniques used for making predictions about future behavior. Credit scoring, risk analysis, and promotion selection are among the many applications that have proven to drive revenue and profit. It is worth taking a look at the “predictive analytics” section of Wikipedia to appreciate the broad scope of these techniques.
In spite of the significant differences among predictive analytic techniques, almost all of them can ingest data as a series of “observations” usually keyed by a familiar data warehouse key such as customer. For the sake of discussion let us assume that we have 10 million customers and we wish to predict future behavior by running a series of predictive analytic applications against these customers’ histories.
The desired input data to the predictive analytic application is a 10 million row table whose primary key is the durable and unique customer key, and whose columns are an open ended number of descriptive attributes, including the usual descriptive fields and demographic fields, but also including indicators, counts, and amounts purposely prepared for the particular run of the predictive analytic application. It is this italicized requirement that makes all this so interesting.
Our input data to the predictive analytics model looks a lot like a very wide table, but it is much more volatile and complex than a normal fairly stable dimension table. Some of the attributes already exist in a normal customer dimension, but many of the interesting attributes are chosen at analysis time by searching fact tables describing customer history and then summarizing that history with a tag, score or amount. Some predictive analytic applications may want to update the input data in real time, and the analyst may wish to dynamically add new calculated attributes to the input data at any time. Finally, a given snapshot of this input data may well be a useful customer dimension in its own right.
If these customer tags, scores and amounts are stable and regularly used for query filtering and grouping beyond predictive analysis, then they could become permanent members of the customer dimension. Two specific design patterns, aggregated facts as dimension attributes and behavior tag time series, are described briefly in the glossary on our website and illustrated in Chapter 8 of The Data Warehouse Toolkit, Third Edition.
The observation data that’s fed into a predictive analysis application from a dimensional model is pretty simple. But how do we populate the table? Is it done in ETL with a complex backroom application that writes this final table in a conventional format? Is it done entirely in SQL at run time where each field in each column is populated by a correlated SELECT statement accessing remote fact tables? Can the data for the observation attributes be stored in an actual dimension, rather than separate fact tables, where the multiple values are stored as an SQL ARRAY of STRUCTS, even as a time series? Are there tools in the Hadoop environment such as Spark or HBase that provide a more natural and efficient way to build the observation dataset that’s handed off for predictive analytics?
Part of my role in this retirement year for the Kimball Group is to describe the enduring relevance of our well-proven dimensional modeling techniques, and at the same time challenging you readers in the data warehouse community to be thinking about new directions for data warehousing in the future. To be honest, I don’t have all the answers to the above questions, but between now and the end of the year I will at least describe the challenges and my thinking on these approaches as best I can.