Hints for SQL Optimization
#
Reviewing dbt and SQL pull requests almost every day, I am collecting some hands on SQL recommendations especially for BigQuery and DuckDB.
There are some great resources on SQL Optimization out there like https://modern-sql.com/ and I am not trying to replicate those. This is a collection of hints that I have find to be frequently overlooked and have impact at scale.
Move static operations, such as adding intervals or performing constant calculations, to the static/constant side of the where
clause.
...SQL Pitfalls
#
Here I am collecting some pitfalls, that I either fell into myself or that I found interesting when I saw them in a pull request or elsewhere.
SUM with NULL Values
#
When performing arithmetic operations with SUM in SQL, the handling of NULL values can lead to unexpected results. This is particularly noticeable when summing columns separately versus summing their combined values directly.
Consider this example:
...
Oct 1, 2024Understanding Support Patterns
#
Finding patterns in the support cases helps to find structural issues and bottlenecks in the data platform.
Just after BigQuery released the ML.GENERATE_TEXT
functions (mid 2024) and made Gemini accessible though BigQuery, I was curios to to test this new feature. And what better use case then understanding Support Patterns and issues with the support on the fly.
Thus, I tried these features by analyzing a sample of our internal support tickets.
...
Aug 5, 2024BigQuery Storage Optimization
#
Over time data easily accumulates. Purging no longer needed data (Bad Data) can save cost and can also reduce the carbon footprint of any Data Warehouse.
In this post, I am describing a simple to identify unused and therefore potentially obsolete data on a table level in BigQuery. This method is easy to reproduce and may help you to also reduce your BigQuery storage cost.
...
Jul 13, 2024GCS Storage Optimization
#
With data volumes continuously growing, optimizing Google Cloud Storage usage can lead to significant cost savings. To tackle this challenge, I developed a Python utility that helps summarize and analyze the stored data, making it easier to identify large files and folders on GCS.
While identifying the total storage cost on a bucket is relatively straight forward using the GCP billing report, identifying large files and folders on buckets can be a tedious task. This utility helps to quickly identify large blobs / files and folders.
...
Jun 21, 2023I had the honor to contribute to the Astonomer Blog on medium with an article on
Running a Multi-Tenant Airflow Cluster
The post explores how we leverage Apache Airflow in a multi-tenant setup to orchestrate diverse workloads like royalty processing, financial analytics, and marketing insights. By utilizing Google Cloud Composer, we balances cost, stability, and operational overhead across multiple teams, ensuring seamless data workflows. Key practices include workload isolation, use of short-lived credentials, atomic tasks, and cost allocation through Kubernetes namespaces. The post also covers my thoughts on continuous deployment, DAG testing, and challenges around for maintaining and upgrading Airflow environments.
...