SQL Pitfalls #
Here I am collecting some pitfalls, that I either fell into myself or that I found interesting when I saw them in a pull request or elsewhere.
SUM with NULL Values #
When performing arithmetic operations with SUM in SQL, the handling of NULL values can lead to unexpected results. This is particularly noticeable when summing columns separately versus summing their combined values directly.
Consider this example:
WITH t AS (
SELECT 2 AS x1, 2 AS x2 UNION ALL
SELECT 6 AS x1, 4 AS x2 UNION ALL
SELECT NULL AS x1, 8 AS x2
)
SELECT
SUM(x1) + SUM(x2) AS addition_after_sum, -- SUMs separately, NULL is ignored
SUM(x1 + x2) AS sum_after_addition -- NULL in x1 causes entire row to be NULL
FROM t;
Which results in:
┌────────────────────┬────────────────────┐
│ addition_after_sum │ sum_after_addition │
│ int128 │ int128 │
├────────────────────┼────────────────────┤
│ 22 │ 14 │
└────────────────────┴────────────────────┘
Breakdown of the Issue:
- SUM(x1) and SUM(x2) ignore NULL values, resulting in:
SUM(x1) = 8
(2 + 6)SUM(x2) = 14
(2 + 4 + 8)SUM(x1) + SUM(x2) = 22
- However,
SUM(x1 + x2)
evaluates the addition row by row:x1 + x2
for the row with NULL results in NULL.
This causes the third row to contribute nothing to the sum, leading to SUM(x1 + x2) = 14
.
To avoid this pitfall, explicitly handle NULL values with COALESCE or IFNULL:
SELECT SUM(x1) + SUM(x2) AS addition_after_sum, SUM(COALESCE(x1, 0) + COALESCE(x2, 0)) AS sum_after_addition FROM t;
This results in consistent output:
┌────────────────────┬────────────────────┐ │ addition_after_sum │ sum_after_addition │ │ int128 │ int128 │ ├────────────────────┼────────────────────┤ │ 22 │ 22 │ └────────────────────┴────────────────────┘
NULLs in the Where Clause #
Nulls are not to be confused with other values. Even if nulls behave as expected most of the time, things can become unintuitive when dealing with checks for inequality. Consider this example:
with t as (
select 1 as v union all
select 2 as v union all
select null as v
)
select * from t where v <> 1; -- returns only where v <> 1 and v is not null
Which only returns:
┌───────┐
│ v │
│ int32 │
├───────┤
│ 2 │
└───────┘
It is not quite clear from this code if it is intentional that only a single row is returned. The intent might have been to returns both the rows with null
and 2
in the v
column.
with t as (
select 1 as v union all
select 2 as v union all
select null as v
)
select * from t where (v <> 1 or v is null);
This returns the additional null row:
┌───────┐
│ v │
│ int32 │
├───────┤
│ 2 │
│ │
└───────┘
I do recommend making the intent explicit in this case using a small comment or handing the null case explicitly.
Filtering by DATETIME in SQL #
Filtering by DATETIME
can lead to unintuitive results, especially when using inequality operators like <=
or >=
for exact date matches. A common pitfall arises when filtering for records up to a specific day or month but unintentionally excluding data from that day due to implicit time handling.
Consider this example:
WITH t AS ( SELECT DATETIME '2024-01-01 10:00:00' AS dt UNION ALL SELECT DATETIME '2024-01-02 15:30:00' AS dt UNION ALL SELECT DATETIME '2024-01-03 08:45:00' AS dt UNION ALL SELECT DATETIME '2024-01-04 12:00:00' AS dt ) SELECT * FROM t WHERE dt <= DATETIME '2024-01-03'; -- Excludes almost all 2024-01-03 records
This query only returns records from
2024-01-01
and2024-01-02
. The row with2024-01-03 08:45:00
is unintentionally excluded becauseDATETIME '2024-01-03'
is interpreted as2024-01-03 00:00:00
.┌─────────────────────┐ │ dt │ │ timestamp │ ├─────────────────────┤ │ 2024-01-01 10:00:00 │ │ 2024-01-02 15:30:00 │ └─────────────────────┘
To ensure all records from
2024-01-03
are included, a simple and clear solution is to add one day to the filter and use the<
(smaller than) operator instead:WITH t AS ( SELECT DATETIME '2024-01-01 10:00:00' AS dt UNION ALL SELECT DATETIME '2024-01-02 15:30:00' AS dt UNION ALL SELECT DATETIME '2024-01-03 08:45:00' AS dt UNION ALL SELECT DATETIME '2024-01-04 12:00:00' AS dt ) SELECT * FROM t WHERE dt < DATE_ADD(DATE '2024-01-03', INTERVAL 1 DAY);
This query effectively filters records before
2024-01-04 00:00:00
, ensuring that all of2024-01-03
is captured.┌─────────────────────┐ │ dt │ │ timestamp │ ├─────────────────────┤ │ 2024-01-01 10:00:00 │ │ 2024-01-02 15:30:00 │ │ 2024-01-03 08:45:00 │ └─────────────────────┘
Deduplication before Joins #
Quite frequently I do see a distinct
clause after a join. This has become a code smell to me because it often indicates that the developer experienced an unintentional fanout, but does not exactly known where it came from.
Fanout (also referred to as High Cardinality Join) occurs when one row multiplies into many during a join, often leading to inflated results and higher query execution costs.
To prevent unintentional fanout deduplicate the join key in CTEs before joining tables.
This is not ideal:
-- setup with _left as ( select 1 as id, 2 as value union all select 2 as id, 4 as value ), _right as ( select 1 as id, 'x' as class, 1 as category union all select 1 as id, 'x' as class, 2 as category ) -- query select distinct id, value, class from _left left join _right using (id);
This can prevent unintentional fanout:
-- setup with _left as ( select 1 as id, 2 as value union all select 2 as id, 4 as value ), _right as ( select 1 as id, 'x' as class, 1 as category union all select 1 as id, 'x' as class, 2 as category ), -- query _right_dedup as ( select distinct id, class from _right ) select id, value, class from _left left join _right_dedup using (id);
Both versions return the same result with the latter being more reliable:
┌───────┬───────┬─────────┐
│ id │ value │ class │
│ int32 │ int32 │ varchar │
├───────┼───────┼─────────┤
│ 1 │ 2 │ x │
│ 2 │ 4 │ │
└───────┴───────┴─────────┘
Empty Strings in Joins #
Also related to unintentional fanout is the behavior of empty stings. Avoid joining on empty strings, as these can create unintended fanout.
This creates 6 output rows:
-- setup with _left as ( select 'a' as id, 'x1' as value union all select 'b' as id, 'y1' as value union all select '' as id, 'z1' as value union all select '' as id, 'z2' as value ), _right as ( select 'a' as id, 'x' as class union all select 'b' as id, 'y' as class union all select '' as id, 'z' as class union all select '' as id, 'z' as class ) -- query select _left.id as left_id, _right.id as right_id, value, class from _left left join _right on _left.id = _right.id;
Here is the result table as markdown:
┌─────────┬──────────┬─────────┬─────────┐
│ left_id │ right_id │ value │ class │
│ varchar │ varchar │ varchar │ varchar │
├─────────┼──────────┼─────────┼─────────┤
│ a │ a │ x1 │ x │
│ b │ b │ y1 │ y │
│ │ │ z1 │ z │
│ │ │ z1 │ z │
│ │ │ z2 │ z │
│ │ │ z2 │ z │
└─────────┴──────────┴─────────┴─────────┘
This can prevent unintentional fanout by not joining on empty strings:
-- setup with _left as ( select 'a' as id, 'x1' as value union all select 'b' as id, 'y1' as value union all select '' as id, 'z1' as value union all select '' as id, 'z2' as value ), _right as ( select 'a' as id, 'x' as class union all select 'b' as id, 'y' as class union all select '' as id, 'z' as class union all select '' as id, 'z' as class ) --query select _left.id as left_id, _right.id as right_id, value, class from _left left join _right on nullif(_left.id, '') = nullif(_right.id, '');
Here is the resulting table:
┌─────────┬──────────┬─────────┬─────────┐
│ left_id │ right_id │ value │ class │
│ varchar │ varchar │ varchar │ varchar │
├─────────┼──────────┼─────────┼─────────┤
│ a │ a │ x1 │ x │
│ b │ b │ y1 │ y │
│ │ │ z1 │ │
│ │ │ z2 │ │
└─────────┴──────────┴─────────┴─────────┘