SQL Optimization

Hints for SQL Optimization #

Reviewing dbt and SQL pull requests almost every day, I am collecting some hands on SQL recommendations especially for BigQuery and DuckDB. There are some great resources on SQL Optimization out there like https://modern-sql.com/ and I am not trying to replicate those. This is a collection of hints that I have find to be frequently overlooked and have impact at scale.


Transformations in the Where Clause #

Move static operations, such as adding intervals or performing constant calculations, to the static/constant side of the where clause.

This is not ideal:

with user_sessions as (
    select current_timestamp - interval '10 minute' as login_time union all
    select current_timestamp - interval '1 hour' union all
    select current_timestamp - interval '20 minute'
)

select 
    login_time, 
from 
    user_sessions
where 
    login_time + interval '1 hour' > current_timestamp;

This is usually better:

with user_sessions as (
    select current_timestamp - interval '10 minute' as login_time union all
    select current_timestamp - interval '1 hour' union all
    select current_timestamp - interval '20 minute'
)

select 
    login_time, 
from 
    user_sessions
where 
    login_time > current_timestamp - interval '1 hour'

Both queries share the same result, looking something like this:

┌────────────────────────────┐
│         login_time         │
│  timestamp with time zone  │
├────────────────────────────┤
│ 2025-01-02 21:07:34.167-05 │
│ 2025-01-02 20:57:34.167-05 │
└────────────────────────────┘

Use standard SQL Functions #

Leverage built-in standard SQL functions for common tasks like formatting dates, string manipulation, and case-based logic to write efficient and readable queries.

This is not ideal:

with transactions as (
    select '20240115' as event_date union all
    select '20230720' union all
    select '20230510' union all
    select '20240930' union all
    select '20241205'
)

select
    strftime('%Y', strptime(data.event_date, '%Y%m%d')) ||
    case
        when extract(month from strptime(data.event_date, '%Y%m%d')) in (1, 2, 3) then 'q1'
        when extract(month from strptime(data.event_date, '%Y%m%d')) in (4, 5, 6) then 'q2'
        when extract(month from strptime(data.event_date, '%Y%m%d')) in (7, 8, 9) then 'q3'
        when extract(month from strptime(data.event_date, '%Y%m%d')) in (10, 11, 12) then 'q4'
    end as year_quarter
from transactions as data;

This would likely be faster:

with transactions as (
    select '20240115' as event_date union all
    select '20230720' union all
    select '20230510' union all
    select '20240930' union all
    select '20241205'
)
select
    strftime('%Y', strptime(data.event_date, '%Y%m%d')) ||  
    'q' || extract(quarter from strptime(data.event_date, '%Y%m%d')) 
    as year_quarter
from transactions as data;

Some dialects also support:

select
    strftime('%Yq%Q', strptime(event_date, '%Y%m%d')) as year_quarter
from transactions;

The result in both cases is the same:

┌──────────────┐
│ year_quarter │
│   varchar    │
├──────────────┤
│ 2024q1       │
│ 2023q3       │
│ 2023q2       │
│ 2024q3       │
│ 2024q4       │
└──────────────┘

Optimizing CASE WHEN with Mappings in SQL #

When dealing with mappings or transformations in SQL, using a large CASE WHEN block directly within the query can be inefficient. A more optimized approach involves moving the mapping logic to a Common Table Expression (CTE) and performing a LEFT JOIN with a fallback using COALESCE for default values.

Suppose we have a sales table with product codes that need to be mapped to specific categories. A simple CASE WHEN block can handle this, but as the number of mappings grows, the query becomes harder to maintain and less performant.

This approach works but can become inefficient over time:

with sales as (
    select 1 as sale_id, 'A1' as product_code, 100 as sale_amount union all
    select 2, 'B2', 150 union all
    select 3, 'C3', 200 union all
    select 4, 'D4', 250 union all
    select 5, 'A1', 120
)
select 
    s.sale_id,
    s.product_code,
    s.sale_amount,
    case s.product_code
        when 'A1' then 'Electronics'
        when 'B2' then 'Clothing'
        when 'C3' then 'Home'
        else 'Other'
    end as category
from sales s;

The CASE WHEN block directly maps product codes to categories. However, this method requires the SQL engine to evaluate the condition for each row individually, resulting in linear complexity (O(n)). As the dataset grows, this can become a bottleneck.

By separating the mapping logic into a CTE and performing a LEFT JOIN, the SQL engine can leverage indexing and optimize the lookup process, reducing complexity to O(log n) for each lookup. This approach is easier to maintain and scales better with larger datasets.

Using a LEFT JOIN with a mapping CTE (or e.g dbt seed) is typically more efficient:

with sales as (
    select 1 as sale_id, 'A1' as product_code, 100 as sale_amount union all
    select 2, 'B2', 150 union all
    select 3, 'C3', 200 union all
    select 4, 'D4', 250 union all
    select 5, 'A1', 120
),
product_mapping as (
    select 'A1' as product_code, 'Electronics' as category union all
    select 'B2', 'Clothing' union all
    select 'C3', 'Home'
)

select 
    s.sale_id,
    s.product_code,
    s.sale_amount,
    coalesce(m.category, 'Other') as category
from sales s
left join product_mapping m
    on s.product_code = m.product_code;

This approach performs a join between the sales table and the mapping CTE. If a product code does not have a matching category, the COALESCE function assigns the default value of Other. This method is not only more efficient but also easier to update by modifying the CTE (or seed) directly.


Take full Advantage of Partitions #

Especially for BigQuery take advantage of partitioned columns in your table schema. Adding conditions on partitions filters data early, improving query efficiency and reducing costs.

This is not ideal:

select distinct not_a_partition_column 
from t
where not_a_partition_column > '2024-03-01'

If you know a relationship between the not_a_partition_column and the partition_column then use it:

This could be faster or simply cheaper when using BigQuery On-Demand pricing:

select distinct not_a_partition_column 
from t
where not_a_partition_column > '2024-03-01'
-- Assuming that `partition_column` is the partitioned column and 
-- not_a_partition_column doesn't have values prior to 2 month before partition_column
  and partition_column > '2024-01-01'

Prefer LIKE over REGEXP #

Often LIKE queries are much more efficient than RegExp, even if regular expressions can be more compact to write.

WITH test_data AS (
  SELECT 'Mechanical license' AS s
  UNION ALL SELECT 'includes MEC.' AS s
  UNION ALL SELECT 'We have MECh licenses here.' AS s
  UNION ALL SELECT 'The MECHaNICS' AS s
  UNION ALL SELECT 'Just mention Mec. once' AS s
  UNION ALL SELECT 'No references to anything' AS s
)
-- expand this data in a way BigQuery cannot easily optimize to simulate a larger dataset
, blown_up as (
SELECT 
    CAST(RAND()*10000 as STRING) 
        || s 
        || CAST(n as STRING) AS s 
FROM 
    test_data,
    UNNEST(GENERATE_ARRAY(1, 1000000)) AS n 
)

On many systems something like this will be slower:

SELECT 
  s,
  NOT REGEXP_CONTAINS(s, r'(?i)(.*(MECHANICS).*)') 
        AND REGEXP_CONTAINS(s, r'(?i)(MEC\.|MECH)') AS reg_exp_matches,
FROM blown_up;

And you might experience the equivalent using LIKE to be faster:

SELECT 
  s,
  (UPPER(s) LIKE "%MECH%"
        OR UPPER(s) LIKE "%MEC.%")
        AND UPPER(s) NOT LIKE "%MECHANICS%" AS like_matches
FROM blown_up;

Skip Intermediate GROUP BYs #

Intermediate GROUP BY clauses in CTEs should be skipped if redundant, because most optimizers cannot identify redundant GROUP BYs clauses in combination with joins. Consider this example:

-- setup
WITH test_data AS (
  SELECT 'apple' AS item, 10 AS quantity
  UNION ALL SELECT 'orange', 15
  UNION ALL SELECT 'apple', 5
  UNION ALL SELECT 'banana', 8
  UNION ALL SELECT 'cucumber', 7
  UNION ALL SELECT 'banana', 10
  UNION ALL SELECT 'cucumber', 5
),
dimension_data AS (
  SELECT 'apple' AS item, 'fruit' AS class
  UNION ALL SELECT 'orange', 'fruit'
  UNION ALL SELECT 'banana', 'fruit'
  UNION ALL SELECT 'cucumber', 'vegetable'
),

This is not ideal e.g. if enriched is a view later used in a BI layer:

enriched AS (
  SELECT
    class,
    SUM(quantity) AS total_quantity
  FROM test_data
  LEFT JOIN dimension_data
    USING (item)
  GROUP BY class
),
final_result AS (
  SELECT
    SUM(total_quantity) AS final_quantity
  FROM enriched
)
SELECT * FROM final_result;

This is usually more optimized :

enriched AS (
  SELECT
    class,
    quantity AS total_quantity
  FROM test_data
  LEFT JOIN dimension_data
    USING (item)
),
final_result AS (
  SELECT
    SUM(total_quantity) AS final_quantity
  FROM enriched
)
SELECT * FROM final_result;

Copyright (c) 2025 Nico Hein