Data Modeling
#
There are great resources out there covering the basic Data Modeling considerations. I do recommend the Book Fundamentals of Data Engineering: Plan and Build Robust Data Systems 1st Edition by Joe Reis, and Matt Housley for getting started.
However, data modeling is often approached with a rigid, almost dogmatic mindset—you’re either an Inmon or Kimball person, or you swear by Data Vault.
In reality, I believe a more pragmatic approach is necessary. No single method is universally superior. More often than not, I find myself making compromises, and the results can be quite elegant when you can introduce a normalized layer in your warehouse. That layer doesn’t need to be permanent—it can be ephemeral, serving its purpose as needed.
...Hints for SQL Optimization
#
Reviewing dbt and SQL pull requests almost every day, I am collecting some hands on SQL recommendations especially for BigQuery and DuckDB.
There are some great resources on SQL Optimization out there like https://modern-sql.com/ and I am not trying to replicate those. This is a collection of hints that I have find to be frequently overlooked and have impact at scale.
Move static operations, such as adding intervals or performing constant calculations, to the static/constant side of the where
clause.
...SQL Pitfalls
#
Here I am collecting some pitfalls, that I either fell into myself or that I found interesting when I saw them in a pull request or elsewhere.
SUM with NULL Values
#
When performing arithmetic operations with SUM in SQL, the handling of NULL values can lead to unexpected results. This is particularly noticeable when summing columns separately versus summing their combined values directly.
Consider this example:
...
Oct 1, 2024Understanding Support Patterns
#
Finding patterns in the support cases helps to find structural issues and bottlenecks in the data platform.
Just after BigQuery released the ML.GENERATE_TEXT
functions (mid 2024) and made Gemini accessible though BigQuery, I was curios to to test this new feature. And what better use case then understanding Support Patterns and issues with the support on the fly.
Thus, I tried these features by analyzing a sample of our internal support tickets.
...
Aug 5, 2024BigQuery Storage Optimization
#
Over time data easily accumulates. Purging no longer needed data (Bad Data) can save cost and can also reduce the carbon footprint of any Data Warehouse.
In this post, I am describing a simple to identify unused and therefore potentially obsolete data on a table level in BigQuery. This method is easy to reproduce and may help you to also reduce your BigQuery storage cost.
...