Hints for SQL Optimization #
Reviewing dbt and SQL pull requests almost every day, I am collecting some hands on SQL recommendations especially for BigQuery and DuckDB. There are some great resources on SQL Optimization out there like https://modern-sql.com/ and I am not trying to replicate those. This is a collection of hints that I have find to be frequently overlooked and have impact at scale.
Transformations in the Where Clause #
Move static operations, such as adding intervals or performing constant calculations, to the static/constant side of the where
clause.
This is not ideal:
with user_sessions as ( select current_timestamp - interval '10 minute' as login_time union all select current_timestamp - interval '1 hour' union all select current_timestamp - interval '20 minute' ) select login_time, from user_sessions where login_time + interval '1 hour' > current_timestamp;
This is usually better:
with user_sessions as ( select current_timestamp - interval '10 minute' as login_time union all select current_timestamp - interval '1 hour' union all select current_timestamp - interval '20 minute' ) select login_time, from user_sessions where login_time > current_timestamp - interval '1 hour'
Both queries share the same result, looking something like this:
┌────────────────────────────┐
│ login_time │
│ timestamp with time zone │
├────────────────────────────┤
│ 2025-01-02 21:07:34.167-05 │
│ 2025-01-02 20:57:34.167-05 │
└────────────────────────────┘
Use standard SQL Functions #
Leverage built-in standard SQL functions for common tasks like formatting dates, string manipulation, and case-based logic to write efficient and readable queries.
This is not ideal:
with transactions as ( select '20240115' as event_date union all select '20230720' union all select '20230510' union all select '20240930' union all select '20241205' ) select strftime('%Y', strptime(data.event_date, '%Y%m%d')) || case when extract(month from strptime(data.event_date, '%Y%m%d')) in (1, 2, 3) then 'q1' when extract(month from strptime(data.event_date, '%Y%m%d')) in (4, 5, 6) then 'q2' when extract(month from strptime(data.event_date, '%Y%m%d')) in (7, 8, 9) then 'q3' when extract(month from strptime(data.event_date, '%Y%m%d')) in (10, 11, 12) then 'q4' end as year_quarter from transactions as data;
This would likely be faster:
with transactions as ( select '20240115' as event_date union all select '20230720' union all select '20230510' union all select '20240930' union all select '20241205' ) select strftime('%Y', strptime(data.event_date, '%Y%m%d')) || 'q' || extract(quarter from strptime(data.event_date, '%Y%m%d')) as year_quarter from transactions as data;
Some dialects also support:
select strftime('%Yq%Q', strptime(event_date, '%Y%m%d')) as year_quarter from transactions;
The result in both cases is the same:
┌──────────────┐
│ year_quarter │
│ varchar │
├──────────────┤
│ 2024q1 │
│ 2023q3 │
│ 2023q2 │
│ 2024q3 │
│ 2024q4 │
└──────────────┘
Optimizing CASE WHEN
with Mappings in SQL
#
When dealing with mappings or transformations in SQL, using a large CASE WHEN
block directly within the query can be inefficient. A more optimized approach involves moving the mapping logic to a Common Table Expression (CTE) and performing a LEFT JOIN
with a fallback using COALESCE
for default values.
Suppose we have a sales
table with product codes that need to be mapped to specific categories. A simple CASE WHEN
block can handle this, but as the number of mappings grows, the query becomes harder to maintain and less performant.
This approach works but can become inefficient over time:
with sales as ( select 1 as sale_id, 'A1' as product_code, 100 as sale_amount union all select 2, 'B2', 150 union all select 3, 'C3', 200 union all select 4, 'D4', 250 union all select 5, 'A1', 120 ) select s.sale_id, s.product_code, s.sale_amount, case s.product_code when 'A1' then 'Electronics' when 'B2' then 'Clothing' when 'C3' then 'Home' else 'Other' end as category from sales s;
The CASE WHEN
block directly maps product codes to categories. However, this method requires the SQL engine to evaluate the condition for each row individually, resulting in linear complexity (O(n)
). As the dataset grows, this can become a bottleneck.
By separating the mapping logic into a CTE and performing a LEFT JOIN
, the SQL engine can leverage indexing and optimize the lookup process, reducing complexity to O(log n)
for each lookup. This approach is easier to maintain and scales better with larger datasets.
Using a
LEFT JOIN
with a mapping CTE (or e.g dbt seed) is typically more efficient:with sales as ( select 1 as sale_id, 'A1' as product_code, 100 as sale_amount union all select 2, 'B2', 150 union all select 3, 'C3', 200 union all select 4, 'D4', 250 union all select 5, 'A1', 120 ), product_mapping as ( select 'A1' as product_code, 'Electronics' as category union all select 'B2', 'Clothing' union all select 'C3', 'Home' ) select s.sale_id, s.product_code, s.sale_amount, coalesce(m.category, 'Other') as category from sales s left join product_mapping m on s.product_code = m.product_code;
This approach performs a join between the sales
table and the mapping CTE. If a product code does not have a matching category, the COALESCE
function assigns the default value of Other
. This method is not only more efficient but also easier to update by modifying the CTE (or seed) directly.
Take full Advantage of Partitions #
Especially for BigQuery take advantage of partitioned columns in your table schema. Adding conditions on partitions filters data early, improving query efficiency and reducing costs.
This is not ideal:
select distinct not_a_partition_column from t where not_a_partition_column > '2024-03-01'
If you know a relationship between the not_a_partition_column
and the partition_column
then use it:
This could be faster or simply cheaper when using BigQuery On-Demand pricing:
select distinct not_a_partition_column from t where not_a_partition_column > '2024-03-01' -- Assuming that `partition_column` is the partitioned column and -- not_a_partition_column doesn't have values prior to 2 month before partition_column and partition_column > '2024-01-01'
Prefer LIKE over REGEXP #
Often LIKE queries are much more efficient than RegExp, even if regular expressions can be more compact to write.
WITH test_data AS (
SELECT 'Mechanical license' AS s
UNION ALL SELECT 'includes MEC.' AS s
UNION ALL SELECT 'We have MECh licenses here.' AS s
UNION ALL SELECT 'The MECHaNICS' AS s
UNION ALL SELECT 'Just mention Mec. once' AS s
UNION ALL SELECT 'No references to anything' AS s
)
-- expand this data in a way BigQuery cannot easily optimize to simulate a larger dataset
, blown_up as (
SELECT
CAST(RAND()*10000 as STRING)
|| s
|| CAST(n as STRING) AS s
FROM
test_data,
UNNEST(GENERATE_ARRAY(1, 1000000)) AS n
)
On many systems something like this will be slower:
SELECT s, NOT REGEXP_CONTAINS(s, r'(?i)(.*(MECHANICS).*)') AND REGEXP_CONTAINS(s, r'(?i)(MEC\.|MECH)') AS reg_exp_matches, FROM blown_up;
And you might experience the equivalent using LIKE to be faster:
SELECT s, (UPPER(s) LIKE "%MECH%" OR UPPER(s) LIKE "%MEC.%") AND UPPER(s) NOT LIKE "%MECHANICS%" AS like_matches FROM blown_up;
Skip Intermediate GROUP BYs #
Intermediate GROUP BY clauses in CTEs should be skipped if redundant, because most optimizers cannot identify redundant GROUP BYs clauses in combination with joins. Consider this example:
-- setup
WITH test_data AS (
SELECT 'apple' AS item, 10 AS quantity
UNION ALL SELECT 'orange', 15
UNION ALL SELECT 'apple', 5
UNION ALL SELECT 'banana', 8
UNION ALL SELECT 'cucumber', 7
UNION ALL SELECT 'banana', 10
UNION ALL SELECT 'cucumber', 5
),
dimension_data AS (
SELECT 'apple' AS item, 'fruit' AS class
UNION ALL SELECT 'orange', 'fruit'
UNION ALL SELECT 'banana', 'fruit'
UNION ALL SELECT 'cucumber', 'vegetable'
),
This is not ideal e.g. if
enriched
is a view later used in a BI layer:enriched AS ( SELECT class, SUM(quantity) AS total_quantity FROM test_data LEFT JOIN dimension_data USING (item) GROUP BY class ), final_result AS ( SELECT SUM(total_quantity) AS final_quantity FROM enriched ) SELECT * FROM final_result;
This is usually more optimized :
enriched AS ( SELECT class, quantity AS total_quantity FROM test_data LEFT JOIN dimension_data USING (item) ), final_result AS ( SELECT SUM(total_quantity) AS final_quantity FROM enriched ) SELECT * FROM final_result;