Designing To The Query Requirements
To address the challenge of designing the data warehouse to be query efficient even for ad hoc access, techniques such as star schemas, denormalization, query parallelism and data partitioning are used to make the data warehouse as query efficient as possible. The aim is to design and layout the fact data to make it as accessible and query efficient as it can be. The partitioning, while used primarily for ease of management, should also be designed for performance, with data partitioned in business meaningful chunks. Data is denormalized to avoid costly joins between huge tables. The fact data and its dimension data are organized into some variant of a star schema to allow star query access between the dimensions and the fact table. Finally, queries are parallelized to allow them to process such vast quantities of data in a realistic time-scale.
The data layout is then designed to take all this into account. The fact and dimension data are placed on disk to minimize the possibilities of any I/O bottlenecks and to maximize parallel access to the fact data. The irony is that, having gone to such great lengths to make the fact data maximally accessible, the next step is to ensure that the users never access it. You should aim to have 90% or more of all queries answered by access to aggregations. A query that runs against an aggregation will avoid the sort and aggregation work that it would have to do if it was run against the fact data. As sort is one of the most costly operations, and one that does not scale linearly as the data grows, the savings made by avoiding sorts are significant.
