The extract, transformation, and load (ETL) system consumes a disproportionate share of the time and effort required to build a data warehouse and business intelligence (DW/BI) environment. Developing the ETL system is challenging because so many outside constraints put pressure on its design: the business requirements, source data realities, budget, processing windows, and skill sets of the available staff. Yet it can be hard to appreciate why the ETL system is so complex and resource-intensive.
Everyone understands the three letters: You get the data out of its original source location (E), you do something to it (T), and then you load it (L) into a final set of tables for the business users to query. When asked about the best way to design and build the ETL system, many designers say, “Well, that depends.” It depends on the source; it depends on limitations of the data; it depends on the scripting languages and ETL tools available; it depends on the staff’s skills; and it depends on the BI tools. But the “it depends” response is dangerous because it becomes an excuse to take an unstructured approach to developing an ETL system, which in the worse-case scenario results in an undifferentiated spaghetti-mess of tables, modules, processes, scripts, triggers, alerts, and job schedules. This “creative” design approach should not be tolerated.
With the wisdom of hindsight from thousands of dimensional data warehouses, a set of ETL best practices have emerged. Careful consideration of these best practices has revealed 34 subsystems that are required in almost every dimensional data warehouse back room. The Kimball Group has organized these 34 subsystems of the ETL architecture into categories which we depict graphically in the linked figures:
- Three subsystems focus on extracting data from source systems.
- Five subsystems deal with value-added cleaning and conforming, including dimensional structures to monitor quality errors.
- Thirteen subsystems deliver data as dimensional structures to the final BI layer, such as a subsystem to implement slowly changing dimension techniques.
- Thirteen subsystems help manage the production ETL environment.
Full coverage of all of these techniques is available in The Data Warehouse Toolkit, Third Edition.