Design Tip #150 Best Practices for Big Data
November 5, 2012
© Kimball Group. All rights reserved.
The big data revolution is well under way. We know that the sheer bigness of the data is not what is interesting. Rather, big data also departs severely from the familiar text and number data that we have stored in relational databases and analyzed with SQL for more than 20 years. The format and content of big data ranges from unstructured free text to highly structured vectors, matrices, images, and collections of name-value pairs. Big data generally needs procedural languages and the ability to program arbitrary new logic. Big data is best exploited by combing across huge unfiltered datasets assembled by combining both historical and real time data.
The big data marketplace is far from mature, but we now have several years of accumulated experience with a number of best practices specific to big data. In this Design Tip I describe very briefly 25 best practices that have emerged from the big data community, some of which are direct extensions of familiar EDW best practices and others are completely new.
Management Best Practices for Big Data
1. Structure big data environments around analytics, not ad hoc querying or standard reporting.
2. Do not attempt to build a legacy big data environment at this time. Rather, plan for disruptive changes coming from every direction: new data types, competitive challenges, programming approaches, hardware, networking technology, and services offered by literally hundreds of new big data providers.
3. Embrace sandboxes and build a practice of productionizing sandbox results. Allow data scientists to construct their data experiments and prototypes using their preferred languages and programming environments. Then, after proof of concept, systematically reprogram and/or reconfigure these implementations with an “IT turnover team.”
4. Put your toe in the water with a simple big data application: backup and archiving.
Architecture Best Practices for Big Data
5. Plan for a logical “data highway” with multiple caches of increasing latency. Physically implement only those caches appropriate for your environment. The data highway can have as many as five caches of increasing data latency, each with its distinct analytic advantages and tradeoffs:
6. Use big data analytics as a “fact extractor” to move data to the next cache.
7. Use big data integration to build comprehensive ecosystems that integrate conventional structured RDMS data, paper based documents, emails, and in-house business-oriented social networking.
8. Plan for data quality to be better further along the data highway.
9. Apply filtering, cleansing, pruning, conforming, matching, joining, and diagnosing at the earliest touch points possible.
10. Implement backflows, especially from the EDW, to earlier caches on the data highway.
11. Implement streaming data analytics in selected data flows.
12. Implement far limits on scalability to avoid a “boundary crash.”
13. Perform big data prototyping on a public cloud and then move to a private cloud.
14. Search for and expect 10x to 100x performance improvements over time, recognizing the paradigm shift for analysis at very high speeds.
15. Separate big data analytic workloads from the conventional enterprise data warehouse to preserve EDW service level agreements.
16. Exploit unique capabilities of in-database analytics.
Data Modeling Best Practices for Big Data
17. Think dimensionally: divide the world into dimensions and facts.
18. Anchor all dimensions with durable surrogate keys.
19. Expect to integrate structured and unstructured data.
20. Track time variance with slowly changing dimensions (SCDs).
21. Get used to not declaring data structures until analysis time.
22. Build technology around name-value pair data sources.
23. Use data virtualization to allow rapid prototyping and schema alterations.
Data Governance Best Practices for Big Data
24. There is no such thing as big data governance. Now that we have your attention, the point is that data governance must be a comprehensive approach for your entire data ecosystem, not a spot solution for big data in isolation.
25. Dimensionalize the data before applying governance. Here is an interesting challenge big data introduces: you must apply data governance principles even when you don’t know what to expect from the content of the data.
Big data brings a host of changes and opportunities to IT and it is easy to think that a whole new set of rules must be created. But with the benefit of almost a decade of experience, many best practices have emerged. Many of these practices are recognizable extensions from the EDW/BI world, and admittedly quite a few are new and novel ways of thinking about data and the mission of IT. But the recognition that the mission has expanded is welcome and is in some ways overdue. The current explosion of data-collecting channels, new data types, and new analytic opportunities mean that the list of best practices will continue to grow in interesting ways.
This Design Tip is brutally brief. For an in depth discussion of these best practices, please see my white paper “Newly Emerging Best Practices for Big Data.” Since this paper was commissioned by Informatica, you’ll need to register on their site to download it. The content is vendor-neutral.