An ever-growing set of data storage technologies offer advantages to the data warehouse architect that look pretty magical. How can you figure out what to do? Should you stick to the tried-and-true relational databases to host your data warehouse? Should you use cubes? Or should you move to the latest and greatest data management solutions like the massively parallel processing (MPP) or columnar solutions? Or to the cloud?
First, let’s talk about the data warehouse database, the main store in which the atomic grain dimensional models are maintained. Everyone who follows the Kimball Method will have one; its design, care, and feeding are the focus of the Kimball Method. You may additionally have downstream cubes or other structures, but for now let’s talk about the atomic data warehouse database.
Most of the organizations we’ve worked with use a RDBMS to host the atomic grain dimensional star schemas for their conformed data warehouses. We observe many SQL Server and Oracle implementations, a smaller number of DB2, and a growing minority of other platforms and technologies. And it’s these alternative technologies that are generating the most questions among our clients.
For most organizations, the plain old RDBMS approach works just fine. You get yourself a server or an internal cloud virtualized environment, install and configure the RDBMS for a data warehouse load, implement and populate your dimensional models, hook up your BI tools, and off you go. The overwhelming majority of sub-terabyte data warehouses use one of the big three RDBMSs to host the atomic dimensional models. These BI systems, if they fail, do not fail because of their choice of database technology.
The landscape is markedly different for very large data warehouse systems. At 10 TB and certainly 100 TB and above, you are far more likely to encounter the alternative technologies.
Alternative Atomic Database Technologies
Generally speaking, there are two main types of database technologies that garner a lot of interest today:
- Massively parallel processing (MPP) solutions which scale out across many small servers rather than scale up one large SMP machine. Available MPP appliances include Teradata, IBM’s Netezza, EMC’s Greenplum, Oracle’s Exadata, and Microsoft’s PDW, among others. We include the open source Hadoop solutions, such as Hive and Impala, in this category of massively distributed deployments.
- Columnar databases, which turn some or all of your tables on their heads by storing data by column rather than by row. Some of the columnar solutions are also massively parallel. The most popular columnar databases include SAP Sybase IQ, InfoBright, HP Vertica, ParAccel (which hosts Amazon’s RedShift), and EXASOL, among others. A data storage structure in Hadoop is the Parquet columnar format, which can be effectively deployed with both Hive and Impala, and provides equivalent benefits to the non-Hadoop columnar database solutions.
Most of the technologies are long out of the experimental stage: Teradata has been around 35 years, SAP Sybase IQ 20+ years, and most of the others from 10 to 15 years.
What do these “New” Technologies Offer?
These technologies offer substantial scalability and query performance improvements for analytic workloads compared to the standard RDBMSs. What they do not offer is magic. There is no magic! You pay for the improved query performance and scalability first through currency: these products are typically more expensive than the corresponding standard RDBMS license plus server cost. You pay for the benefits through a somewhat more complex ETL system. I observe as many people swearing at these technologies as I see swearing by them. Each technology has its own foibles, but updates and deletes seem particularly problematic for many. Nonetheless, if you have a large system—measured both by data volumes and the complexity of usage—you will figure it out and make it work. This becomes just one more in a long string of decisions where we place a burden on ETL in order to deliver a great user experience.
For the very largest data warehouses, these MPP and/or columnar technologies are de rigueur. For the vast majority of small to mid-sized BI environments, they may simply be too expensive in terms of currency, expertise, or system complexity, to be worth spending much time worrying about. Medium to large systems, say 750 GB to 10 TB, have the hardest decision. The factors that would indicate the newer technologies include:
- High data volume
- Rapid planned growth (MPP systems scale more easily)
- Simpler ETL requirements with minimal fact table updates (which tend to be more problematic for these technologies)
- An architecture that uses little or no OLAP or cube technology, and hence has a large and diverse query load directly on the atomic data warehouse database.
What about Cubes?
Architectural decisions can never be made in isolation. Your decision about the atomic data warehouse database technology influences, and is influenced by, your downstream architecture including how you will use cubes. The more you plan to use cubes—whether older OLAP cubes or newer in-memory structures—the less you will need direct ad hoc access to the atomic data warehouse database. In the unusual extreme, where 100% of user access comes through cubes, the DW database is relegated to a purely back room existence.
Your business users often love cubes. They offer a nice user experience for ad hoc querying because they are fundamentally designed for dimensional data. They provide a place to centrally define and share complex calculations. And they usually deliver very good query performance.
As we discussed in Design Tip 162, the atomic relational data warehouse database is the required piece of the architecture. Cubes, much as we love them, are optional. But their inclusion in your architecture does affect your decision about technology for the data warehouse database, as well as some of the physical implementation decisions such as indexing strategies.
And what about the Cloud?
This Design Tip made mention of the cloud in its first paragraph, so it’s fitting that we end there. The tool and database vendors have been making substantial investments in cloud hosting of their tools and services. But among the data warehouse / business intelligence implementations that we work with, we’ve seen little adoption. First, there is the inevitable letdown when people realize that a tool in the cloud is just that: pretty much the same old tool, now in the cloud. No, there’s no magic, you still have to write the ETL.
The earliest adopters are organizations who have already made an investment to hosting their operational systems in the cloud. Assuming reasonable data volumes, using the cloud to host their DW/BI environment is sensible and probably cost effective. Some organizations and industries will not adopt any time soon, because they are forbidden by law or policy to move their data to a cloud environment. But I’m reasonably confident that what today is a trickle will become more of a flood. The pricing models will grow more appealing, and the notion of managing servers and databases in house will seem old fashioned. But I don’t have a magic ball that will tell me when that transition will occur.