Wednesday, February 13, 2008

Data Warehousing Dilemma

We are in the midst of a long running Business Intelligence project. The original thought behind the iniative was a solution allowing analysts to explore their data. Through a combination of historical data and predictive modelling, the solution would provide key metrics to manage their business. After several mis-starts, we discovered the wisdom of Ralph Kimball.


Kimball's Data Warehouse Toolkit was an ephifany and inspiration. Suddenly before us was the solution to performance woes. His solution? A star schema, where with the warehouse measures are retained in a single table linked with related descriptive data. All the descriptive data (dimensions) were related through the single fact table containing the measures.

Our project started simple enough. Using sample report templates from our Product Management team, we determined a grain for the warehouse and computed the appropriate measures. The grain, by the way, is the lowest level of detail needed to answer the questions asked of the data.

It soon became apparent that there was a flaw in our design. Not the design of the warehouse, per se, but the design of the system. The reports designed by our Product Management team were at the level of the grain. Running them would produce thousands of pages of detail. Nowhere were we taking advantage of the warehouse's dimensions to drill into these reports. Seeing the flaw, we tasked our Product Managers with spec'ing the entry points to their reports.

What was returned to us was a disaster. Instead of taking advantage of the warehouse or even the capabilities of the Business Intelligence technology, the PMs designed more reports. These new reports were summaries with drill paths to the detail provided earlier. They also contained an entirely new set of metrics, all calculated at a different grain.

The problem of the different grain was exocerbated by many of the new metrics. These metrics were computed with division of aggregated counts. The counts, however, were not on the fact table, instead they were distinct counts of dimension values. There is the dilemma, we needed figures that could not be pre-computed into our cubes. The metrics were computed "on-the-fly" and resulted in tremendous performance problems.

I believe the solution is simple and obvious. We need additional fact tables a different grains. The purists among my team didn't see it so clearly (they will by time we're finished). The problem is Kimble's treatse on data warehouses discourages multiple fact tables in the database schema.

Kimble oversimplifyies warehouses with the star schema. Any complex set of data will have measures that can not be summarized into a single fact table. In truth, though, multiple fact tables will be an integral part of any practical solution based on a data warehouse.

No comments:

Post a Comment

You might also like ...