Data Modeling #
There are great resources out there covering the basic Data Modeling considerations. I do recommend the Book Fundamentals of Data Engineering: Plan and Build Robust Data Systems 1st Edition by Joe Reis, and Matt Housley for getting started.
However, data modeling is often approached with a rigid, almost dogmatic mindset—you’re either an Inmon or Kimball person, or you swear by Data Vault.
In reality, I believe a more pragmatic approach is necessary. No single method is universally superior. More often than not, I find myself making compromises, and the results can be quite elegant when you can introduce a normalized layer in your warehouse. That layer doesn’t need to be permanent—it can be ephemeral, serving its purpose as needed.
In the below section I am covering considerations I recommend before making data modeling decisions.
OBT vs Star Schema with Live Joins #
This little guide addresses a key decision in analytics engineering: choosing between live joins in BI tools (e.g. Looker) and using large preprocessed tables. This decision is nuanced and depends on several critical factors.
When should one opt for a live join vs one big prepared and preprocessed table?
This is a hard question with a lot of nuances in it’s answer. In this guide I am listing the key factors I usually consider when making a decision between pre-calculated or live joins. Even a combination of both approaches is sometimes appropriate. The factors include but are not always limited to:
- Data volume
- Frequency of updates
- Frequency of usage
- Dashboard load times
- The incrementality of the central fact table data
- Historical data depth
- Requirements of joins
- Data warehouse billing models
- Development time
- Maintainability
- Data Governance and Security
In the below sections I elaborate how each of the above dimensions influences the decision making:
Data Volume #
Smaller data is cheaper and faster to re-process. The larger the data volume, the more careful this evaluation has to be executed because the final design can have tremendous impact on overall solution cost, performance of pre-processing and analytics.
Consider not just the current size, but projected growth. A scalable solution is essential for rapidly growing datasets.
Frequency of Updates #
The higher the data frequency, the more important becomes the consideration of processing overhead when reprocessing unchanged data. It is important to consider use of techniques like data partitioning and indexing to optimize for incremental table updates. The lower the frequency of updates, the less we need to be concerned about reprocessing the entire data.
Frequency of Use #
The higher the frequency of use the more optimized and curated the underlying data should be. With a greater factor between frequency-of-updates and frequency-of-use, preprocessing should be moved upstream so that basic calculations are not repeated unnecessarily.
Consider the concurrency of use when one dashboard load yields multiple queries on the data warehouse.
Dashboard Load Times #
The higher the expectation and requirements for fast data load the better the caching and data preprocessing layer needs to be aligned with the analytics needs. It is important to balance data freshness with performance.
Incrementality of Central Fact Table #
If the central fact table is incremental in nature and it is not expected that historical data changes, the joins and usage patterns outlined below will determine the importance of live joins in the analytics layer.
Also consider that full refreshes might be necessary despite incrementally, due to schema changes or data corrections.
Historical Data Depth #
For incremental and time partitioned tables, the historical data depth should be considered as less historical data depth is inherently linked to smaller data volume.
Sometimes it is worth considering two distinct solutions for older, less frequently required data and fresh and frequently used and required data.
Requirements of Joins #
If the data has to be joined to frequently changing dimension tables, it is important to understand the impact of these joins to historical data. If the new dimension information has to be applied to historical data, either full reprocessing or live joins might be valid options.
Incremental pre-processing can only be applied if there is a tolerance for historical data to be enriched with stale dimension data. A higher frequency of change in the dimension data, that has to be applied to historical data, indicates that a live join might become the solution of choice.
The complexity of the joins and the size of the of the dimension tables also plays a role in decision making. Join complexity can dramatically impact query performance. Query optimization techniques for e.g. pre-joining multiple related dimension tables might have to be considered. If otherwise complex joins can be simplified by precomputing an efficient join key, this can be considered to make a live join cheaper.
Data Warehouse Billing Models #
There are two main billing models for data warehouses:
- billing by amount of data read for each query (e.g. BigQuery On Demand billing)
- billing by amount of compute resources used for each query (e.g. Snowflake & BigQuery Editions)
Due to higher redundancy in big denormalized tables, the amount of data read in live joins can be much lower than the amount of data read for a query yielding the same result pointing to one big table. However, with every join the amount of compute required to answer a query increases. Reading from a single table often is computationally cheaper.
Development Time #
The more optimized a data model becomes with incremental processing and more data curation steps the more development time is required. Also consider that faster development might lead to higher maintenance costs later. Try to strike a balance between upfront development time, long-term cost, and maintenance.
Maintainability #
The more denormalized a schema becomes, the higher is the risk for data consistency issues. The more complex the overall solution becomes the more important it is that the solution is quality controlled, and well documented. The complexity of the data quality control framework should be considered as part of the initial design.
Data Governance and Security #
Data access controls and compliance requirements can influence the architecture choice. Live joins may offer more granular control in some scenarios.
Conclusion #
The choice between live joins and a preprocessed big table involves balancing multiple factors including performance, cost, structure of the data, maintainability, and scalability. This guide provides a framework for making informed decisions, but each scenario should be evaluated on its unique requirements and constraints.