One big fat dimension vs Multiple Small dimensions

Written by: Ankit G

Some days ago I came across a DW query posted on a technical discussion:

“Based on prior experience, can someone please comment on which is better: small dimensions versus a large dimension. I am asking in terms of performance and reporting?”

Now this is a common dilemma which DW designers face while designing Star Schemas. My thoughts:

If we want to split a dimension (along width) we can do that by snowflaking it (normalizing the dimensions) or by creating mini-dimensions. Snowflake schema will reduce data redundancy and duplication, but it will increase the number of joins required to fetch all the dimensional information.

The decision needs to be made based on two major factors: Performance and Usability

Dimension tables that contain customer or product information might easily have 50 to 100 attributes and many millions of rows. However, dimension tables with too many attributes can lead to excessively wide rows and poor performance. For this reason, you might want to separate out certain groups of attributes from a dimension table and put them in a separate table called a minidimension table.

A minidimension table consists of a small group of attributes that are separated out from a larger dimension table. You might choose to create a minidimension table for attributes that have either of the following characteristics:
  • The fields are rarely used as constraints in a query

  • The fields are frequently compared together

More information can be found on this link

Snowflaking also helps in reducing the width of dimension by splitting it. That works even better with hierarchical structure.

Now coming to UsabilitySimpler the better. So from usability perspective STAR schema (i.e. not splitting the dimension in question) is better. By keeping all the attributes together in one big fat dimension, we make is easy for the users to do find related attributes. They can pull many or all of them without exploding number of joins within query. Less number of joins would mean better query plan, easy to control the usage of indexes and simpler to debug. Even for designers/modellers, star schema works the best.

But we earlier talked that a big fat dimension like customer can be slow, didn’t we?? Well, the time is changing. We now have better database engines and more powerful servers. Even large volume dimensions like CUSTOMER can be easily handled by majority of databases these days. If we exploit all the performance optimization features, the volume of data in dimensions will hardly make any significant difference in performance. Here are still some points to be considered:

1) Indexing - Always have bitmap indexes on "where clause" candidates
2) Partitioning – Use appropriate partitioning and partitioning keys
3) If we have an SCD, we can reduce data by choosing which SCD type to use (trade-off between history /space /performance)
4) We can also use B-tree indexes that would be faster (then bitmap depending on the cardinality), but will occupy much more space then Bitmap
5) Histograms can be used
6) Materialized views to be used in case we want to split dimensions, but still want to stick to star schema approach. This does make sense as data in dimensions needs not to be refreshed very often
7) Convert the Dimension Table into a View and move each Hierarchy to a separate table, like partitioning. The RDBMS should be clever enough to only use the tables that are referenced in the queries
8) There are other measures/features to improve performance which are specific to databases like for oracle you can "Run Stats", Sybase IQ (Comes up with inbuilt indexes) and so on

So while having a huge dimension with hundreds of attributes can be daunting to users; splitting it into many and making a very complex model will also discourage them from using it for analysis. Acceptance of easy-to-use cubes/dimensions is key to successful BI deployments. The strategy should be a trade-off between the above two approaches keeping in mind various business /technical specific issues, some of which are specified above.

Comments