SQL Pitfalls

SQL Pitfalls #

Here I am collecting some pitfalls, that I either fell into myself or that I found interesting when I saw them in a pull request or elsewhere.


SUM with NULL Values #

When performing arithmetic operations with SUM in SQL, the handling of NULL values can lead to unexpected results. This is particularly noticeable when summing columns separately versus summing their combined values directly.

Consider this example:

WITH t AS (
  SELECT 2 AS x1, 2 AS x2 UNION ALL
  SELECT 6 AS x1, 4 AS x2 UNION ALL
  SELECT NULL AS x1, 8 AS x2
)

SELECT 
  SUM(x1) + SUM(x2) AS addition_after_sum,  -- SUMs separately, NULL is ignored
  SUM(x1 + x2) AS sum_after_addition        -- NULL in x1 causes entire row to be NULL
FROM t;

Which results in:

┌────────────────────┬────────────────────┐
│ addition_after_sum │ sum_after_addition │
│       int128       │       int128       │
├────────────────────┼────────────────────┤
│                 22 │                 14 │
└────────────────────┴────────────────────┘

Breakdown of the Issue:

  1. SUM(x1) and SUM(x2) ignore NULL values, resulting in:
    • SUM(x1) = 8 (2 + 6)
    • SUM(x2) = 14 (2 + 4 + 8)
    • SUM(x1) + SUM(x2) = 22
  2. However, SUM(x1 + x2) evaluates the addition row by row:
    • x1 + x2 for the row with NULL results in NULL.

This causes the third row to contribute nothing to the sum, leading to SUM(x1 + x2) = 14.

To avoid this pitfall, explicitly handle NULL values with COALESCE or IFNULL:

SELECT 
  SUM(x1) + SUM(x2) AS addition_after_sum,
  SUM(COALESCE(x1, 0) + COALESCE(x2, 0)) AS sum_after_addition
FROM t;

This results in consistent output:


┌────────────────────┬────────────────────┐
│ addition_after_sum │ sum_after_addition │
│       int128       │       int128       │
├────────────────────┼────────────────────┤
│                 22 │                 22 │
└────────────────────┴────────────────────┘

NULLs in the Where Clause #

Nulls are not to be confused with other values. Even if nulls behave as expected most of the time, things can become unintuitive when dealing with checks for inequality. Consider this example:

with t as (
  select 1 as v union all
  select 2 as v union all
  select null as v
)

select * from t where v <> 1; -- returns only where v <> 1 and v is not null

Which only returns:

┌───────┐
│   v   │
│ int32 │
├───────┤
│     2 │
└───────┘

It is not quite clear from this code if it is intentional that only a single row is returned. The intent might have been to returns both the rows with null and 2 in the v column.

with t as (
  select 1 as v union all
  select 2 as v union all
  select null as v
)

select * from t where (v <> 1 or v is null);

This returns the additional null row:

┌───────┐
│   v   │
│ int32 │
├───────┤
│     2 │
│       │
└───────┘

I do recommend making the intent explicit in this case using a small comment or handing the null case explicitly.


Filtering by DATETIME in SQL #

Filtering by DATETIME can lead to unintuitive results, especially when using inequality operators like <= or >= for exact date matches. A common pitfall arises when filtering for records up to a specific day or month but unintentionally excluding data from that day due to implicit time handling.

Consider this example:

WITH t AS (
  SELECT DATETIME '2024-01-01 10:00:00' AS dt UNION ALL
  SELECT DATETIME '2024-01-02 15:30:00' AS dt UNION ALL
  SELECT DATETIME '2024-01-03 08:45:00' AS dt UNION ALL
  SELECT DATETIME '2024-01-04 12:00:00' AS dt
)

SELECT *
FROM t
WHERE dt <= DATETIME '2024-01-03'; -- Excludes almost all 2024-01-03 records

This query only returns records from 2024-01-01 and 2024-01-02. The row with 2024-01-03 08:45:00 is unintentionally excluded because DATETIME '2024-01-03' is interpreted as 2024-01-03 00:00:00.

┌─────────────────────┐
│         dt          │
│      timestamp      │
├─────────────────────┤
│ 2024-01-01 10:00:00 │
│ 2024-01-02 15:30:00 │
└─────────────────────┘

To ensure all records from 2024-01-03 are included, a simple and clear solution is to add one day to the filter and use the < (smaller than) operator instead:

WITH t AS (
  SELECT DATETIME '2024-01-01 10:00:00' AS dt UNION ALL
  SELECT DATETIME '2024-01-02 15:30:00' AS dt UNION ALL
  SELECT DATETIME '2024-01-03 08:45:00' AS dt UNION ALL
  SELECT DATETIME '2024-01-04 12:00:00' AS dt
)
SELECT *
FROM t
WHERE dt < DATE_ADD(DATE '2024-01-03', INTERVAL 1 DAY);

This query effectively filters records before 2024-01-04 00:00:00, ensuring that all of 2024-01-03 is captured.

┌─────────────────────┐
│         dt          │
│      timestamp      │
├─────────────────────┤
│ 2024-01-01 10:00:00 │
│ 2024-01-02 15:30:00 │
│ 2024-01-03 08:45:00 │
└─────────────────────┘

Deduplication before Joins #

Quite frequently I do see a distinct clause after a join. This has become a code smell to me because it often indicates that the developer experienced an unintentional fanout, but does not exactly known where it came from. Fanout (also referred to as High Cardinality Join) occurs when one row multiplies into many during a join, often leading to inflated results and higher query execution costs.

To prevent unintentional fanout deduplicate the join key in CTEs before joining tables.

This is not ideal:

-- setup
with _left as (
    select 1 as id, 2 as value union all
    select 2 as id, 4 as value
),
_right as (
    select 1 as id, 'x' as class, 1 as category union all
    select 1 as id, 'x' as class, 2 as category
)
-- query
select distinct
    id,
    value,
    class
from _left
left join _right
using (id);

This can prevent unintentional fanout:

-- setup
with _left as (
    select 1 as id, 2 as value union all
    select 2 as id, 4 as value
),
_right as (
    select 1 as id, 'x' as class, 1 as category union all
    select 1 as id, 'x' as class, 2 as category
),
-- query
_right_dedup as (
    select distinct
        id,
        class
    from _right
)
select
    id,
    value,
    class
from _left
left join _right_dedup
using (id);

Both versions return the same result with the latter being more reliable:

┌───────┬───────┬─────────┐
│  id   │ value │  class  │
│ int32 │ int32 │ varchar │
├───────┼───────┼─────────┤
│     1 │     2 │ x       │
│     2 │     4 │         │
└───────┴───────┴─────────┘

Empty Strings in Joins #

Also related to unintentional fanout is the behavior of empty stings. Avoid joining on empty strings, as these can create unintended fanout.

This creates 6 output rows:

-- setup
with _left as (
    select 'a' as id, 'x1' as value union all
    select 'b' as id, 'y1' as value union all
    select '' as id, 'z1' as value union all
    select '' as id, 'z2' as value
),
_right as (
    select 'a' as id, 'x' as class union all
    select 'b' as id, 'y' as class union all
    select '' as id, 'z' as class union all
    select '' as id, 'z' as class
)
-- query
select
    _left.id as left_id,
    _right.id as right_id,
    value,
    class
from _left
left join _right
on _left.id = _right.id;

Here is the result table as markdown:

┌─────────┬──────────┬─────────┬─────────┐
│ left_id │ right_id │  value  │  class  │
│ varchar │ varchar  │ varchar │ varchar │
├─────────┼──────────┼─────────┼─────────┤
│ a       │ a        │ x1      │ x       │
│ b       │ b        │ y1      │ y       │
│         │          │ z1      │ z       │
│         │          │ z1      │ z       │
│         │          │ z2      │ z       │
│         │          │ z2      │ z       │
└─────────┴──────────┴─────────┴─────────┘

This can prevent unintentional fanout by not joining on empty strings:

-- setup
with _left as (
    select 'a' as id, 'x1' as value union all
    select 'b' as id, 'y1' as value union all
    select '' as id, 'z1' as value union all
    select '' as id, 'z2' as value
),
_right as (
    select 'a' as id, 'x' as class union all
    select 'b' as id, 'y' as class union all
    select '' as id, 'z' as class union all
    select '' as id, 'z' as class
)
--query
select
    _left.id as left_id,
    _right.id as right_id,
    value,
    class
from _left
left join _right
on nullif(_left.id, '') = nullif(_right.id, '');

Here is the resulting table:

┌─────────┬──────────┬─────────┬─────────┐
│ left_id │ right_id │  value  │  class  │
│ varchar │ varchar  │ varchar │ varchar │
├─────────┼──────────┼─────────┼─────────┤
│ a       │ a        │ x1      │ x       │
│ b       │ b        │ y1      │ y       │
│         │          │ z1      │         │
│         │          │ z2      │         │
└─────────┴──────────┴─────────┴─────────┘
Copyright (c) 2025 Nico Hein