Insurance is an important and growing sector for the data warehousing market. Several factors have come together in the last year or two to make data warehouses for large insurance companies both possible and extremely necessary. Insurance companies generate several complicated transactions that must be analyzed in many different ways. Until recently, it wasn’t practical to consider storing hundreds of millions — or even billions — of transactions for online access. With the advent of powerful SMP and MPP Unix processors and powerful database query software, these big complicated databases have begun to enter the comfort zone for data warehousing. At the same time, the insurance industry is under incredible pressure to reduce costs. Costs in this business come almost entirely from claims or “losses,” as the insurance industry more accurately describes them.
The design of a big insurance data warehouse must deal with several issues common to all insurance companies. This month, I use InsureCo as a case study to illustrate these issues and show how to resolve them in a data warehouse environment. InsureCo is the pseudonym of a major insurance company that offers automobile, homeowner’s, and personal property insurance to about two million customers. InsureCo has annual revenues of more than $2 billion. My company designed InsureCo’s corporate data warehouse for analyzing all claims across all its lines of business, with history in some cases stretching back more than 15 years.
The first step at InsureCo was to spend two weeks interviewing prospective end users in claims analysis, claims processing, field operations, fraud and security management, finance, and marketing. We talked to more than 50 users, ranging from individual contributors to senior management. From each group of users we elicited descriptions of what they did in a typical day, how they measured the success of what they did, and how they thought they could understand their businesses better. We did not ask them what they wanted in a computerized database. It was our job to design, not theirs.
From these interviews we found three major themes that profoundly affected our design. First, to understand their claims in detail, the users needed to see every possible transaction. This precluded presenting summary data only. Many end-user analyses required the slicing and dicing of the huge pool of transactions.
Second, the users needed to view the business in monthly intervals. Claims needed to be grouped by month, and compared at month’s end to other months of the same year, or to months in previous years. This conflicted with the need to store every transaction, because it was impractical to roll-up complex sequences of transactions just to get monthly premiums and monthly claims payments. Third, we needed to deal with the heterogeneous nature of InsureCo’s lines of business. The facts recorded for an automobile accident claim are different than those recorded for a homeowner’s fire loss claim or for a burglary claim.
These data conflicts arise in many different industries, and are familiar themes for data warehouse designers. The conflict between the detailed transaction view and the monthly snapshot view almost always requires that you build both kinds of tables in the data warehouse. We call these the transaction views and monthly snapshot views of a business. Note that we are not referring to SQL views here, but to physical tables. The need to analyze the entire business across all products (lines of business in InsureCo’s case) versus the need to analyze a specific product with unique measures is called the “heterogeneous products” problem. At InsureCo, we first tackled the transaction and monthly snapshot views of the business by carefully dimensionalizing the base-level claims processing transactions. Every claims processing transaction was able to fit into the star join schema.
This structure is characteristic of transaction-level data warehouse schemas. The central transaction-level fact table consists almost entirely of keys. Transaction fact tables typically have only one additive fact, which we call Amount. The interpretation of the Amount field depends on the transaction type, which is identified in the transaction dimension. The Time dimension is actually two instances of the same dimension table connecting to the fact table to provide independent constraints on the Transaction Date and the Effective Date.
This transaction-level star join schema provided an extremely powerful way for InsureCo to analyze claims. The number of claimants, the timing of claims, the timing of payments made, and the involvement of third parties, such as witnesses and lawyers, were all easily derived from this view of the data. Strangely enough, it was somewhat difficult to derive “claim-to-date” measures, such as monthly snapshots, because of the need to crawl through every detailed transaction from the beginning of history. The solution was to add to InsureCo’s data warehouse a monthly snapshot version of the data. The monthly snapshot removed some of the dimensions, while adding more facts.
The grain of this monthly snapshot fact table was the monthly activity of each claimant’s claim against InsureCo’s insured party. Several of the transaction schema dimensions were suppressed in this monthly snapshot, including Effective Date, Employee, Third Party, and Transaction Type. However, it was important to add a Status dimension to the monthly snapshot so that InsureCo could quickly find all open, closed, and reopened claims. The list of additive, numeric facts was expanded to include several useful measures. These include the amount of the reserve set aside to pay for a claim, amounts paid and received during the month, and an overall count of the transaction activity for this claim. This monthly snapshot schema was extremely useful at InsureCo as a way to rapidly analyze the month-to-month changes in claims and exposure to loss. Monthly snapshot tables were very flexible because interesting summaries could be added as facts, almost at will. Of course, we could never add enough summary buckets to do away with the need for the transaction schema itself. There are hundreds of detailed measures, representing combinations and counts and timings of interesting transactions, all of which would be suppressed if we didn’t preserve the detailed transaction history.
After dispensing with the first big representation problem, we faced the problem of how to deal with heterogeneous products. This problem arose primarily in the monthly snapshot fact table, in which we wanted to store additional monthly summary measures specific to each line of business. These additional measures included automobile coverage, homeowner’s fire coverage, and personal article loss coverage. After talking to the insurance specialists in each line of business, we realized that there were at least 10 custom facts for each line of business. Logically, our fact table design could be extended to include the custom facts for each line of business, but physically we had a disaster on our hands.
Because the custom facts for each line of business were incompatible with each other, for any given monthly snapshot record, most of the fact table was filled with nulls. Only the custom facts for the particular line of business were populated in any given record. The answer was to separate physically the monthly snapshot fact table by coverage type. We ended up with a single core monthly snapshot schema, and a series of custom monthly snapshot schemas, one for each coverage type.
A key element of this design was the repetition of the core facts in each of the custom schemas. This is sometimes hard for a database designer to accept, but it is very important. The core schema is the one InsureCo uses when analyzing the business across different coverage types. Those kinds of analyses use only the core table. InsureCo uses the Automobile Custom schema when analyzing the automobile segment of the business. When performing detailed analyses within the automobile line of business, for example, it is important to avoid linking to the core fact table to get the core measures such as amounts paid and amounts received. In these large databases, it is very dangerous to access more than one fact table at a time. It is far better, in this case, to repeat a little of the data in order to keep the users’ queries confined to single fact tables.
The data warehouse we built at InsureCo is a classic example of a large data warehouse that has to accommodate the conflicting needs for detailed transaction history, high-level monthly summaries, company-wide views, and individual lines of business. We used standard data warehouse design techniques, including transaction views and monthly snapshot views, as well as heterogeneous product schemas to address InsureCo’s needs. This dimensional data warehouse gives the company many interesting ways to view its data.