BigQuery Storage Optimization

BigQuery Storage Optimization

BigQuery Storage Optimization #

Over time data easily accumulates. Purging no longer needed data (Bad Data) can save cost and can also reduce the carbon footprint of any Data Warehouse.

In this post, I am describing a simple to identify unused and therefore potentially obsolete data on a table level in BigQuery. This method is easy to reproduce and may help you to also reduce your BigQuery storage cost.

To identify unused data we need:

  1. an index of all tables that are in scope
  2. usage statistics for all this data

Table Index #

BigQuery provides great metadata within the INFORMATION_SCHEMA tables. To create an index of all tables INFORMATION_SCHEMA.TABLE_STORAGE_BY_ORGANIZATION is a good starting point.

The official documentation can be found here. Depending on where your data is stored, you might need to change the storage pricing applied below.

WITH
  table_storage_info AS (
  SELECT
    project_number,
    project_id,
    table_schema,
    table_name,
    creation_time,
    SUM(active_logical_bytes) 
	    / POWER(1024, 4) AS active_logical_terabytes,
    SUM(long_term_logical_bytes) 
	    / POWER(1024, 4) AS long_term_logical_terabytes,
    SUM(active_physical_bytes + fail_safe_physical_bytes) 
	    / POWER(1024, 4) AS active_physical_terabytes,
    SUM(long_term_physical_bytes) 
	    / POWER(1024, 4) AS long_term_physical_terabytes
  FROM
    [PROJECT_ID].`region-REGION`.INFORMATION_SCHEMA.TABLE_STORAGE_BY_ORGANIZATION
  WHERE
    NOT STARTS_WITH(table_schema, '_') -- excluding cache and temp tables
  GROUP BY
    ALL ),
  tables_with_cost AS (
  SELECT
    project_number,
    project_id,
    table_schema,
    table_name,
    creation_time,
    active_logical_terabytes,
    long_term_logical_terabytes,
    active_physical_terabytes,
    long_term_physical_terabytes,
    active_logical_terabytes * 1024 * 0.02 AS active_logical_dollars,
    long_term_logical_terabytes * 1024 * 0.01 AS long_term_logical_dollars,
    active_physical_terabytes * 1024 * 0.044 AS active_physical_dollars,
    long_term_physical_terabytes * 1024 * 0.022 AS long_term_physical_dollars
  FROM
    table_storage_info )
  
SELECT * FROM tables_with_cost

Given this index of stored tables, we can now proceed and identify unused and with that, potentially Bad Data.

Identify Unused Data #

We can now left join the list of existing tables with the usage statistics. This eliminates all statistics for tables that no longer exists.

We start with extracting the table details from all Jobs. For this we use the INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION table and look at both referenced tables and destination tables but exclude tables in cache and temporary tables.

Please note that INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION currently contains the job history for the last 180 days. If you want to want to have a shorter threshold introduce a WHERE clause. For a longer threshold, you would need to first start maintaining a table containing a longer history.

WITH referenced_tables AS (
  SELECT
    destination_table.project_id AS project_id,
    destination_table.dataset_id AS dataset_id,
    destination_table.table_id AS table_id,
    creation_time,
    user_email
  FROM
    [PROJECT_ID].`region-REGION`.INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION AS bigquery_jobs ),
  destination_tables AS (
  SELECT
    bigquery_jobs__referenced_tables.project_id AS project_id,
    bigquery_jobs__referenced_tables.dataset_id AS dataset_id,
    bigquery_jobs__referenced_tables.table_id AS table_id,
    creation_time,
    user_email
  FROM
    [PROJECT_ID].`region-REGION`.INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION AS bigquery_jobs
  LEFT JOIN
    UNNEST( bigquery_jobs.referenced_tables ) AS bigquery_jobs__referenced_tables ),
  table_jobs AS (
  SELECT
    project_id,
    dataset_id,
    table_id,
    creation_time,
    user_email
  FROM
    referenced_tables
  UNION ALL
  SELECT
    project_id,
    dataset_id,
    table_id,
    creation_time,
    user_email
  FROM
    destination_tables ),
used_tables as (
SELECT
  project_id,
  dataset_id,
  table_id,
  MAX(creation_time) AS latest_job_creation_time,
  COUNT(DISTINCT user_email) AS distinct_user_count,
  COUNT(1) AS job_count
FROM
  table_jobs
WHERE
  NOT STARTS_WITH(dataset_id, '_')
GROUP BY
  ALL
)
SELECT * FROM used_tables

Now we can bring the it all together and join the two tables:

WITH tables_with_cost_and_usage AS (
  SELECT 
    tables_with_cost.*,
    used_tables.latest_job_creation_time,
    used_tables.distinct_user_count,
    used_tables.job_count
  FROM
    tables_with_cost
  LEFT JOIN
    used_tables
  ON
      used_tables.project_id = tables_with_cost.project_id
      AND used_tables.dataset_id = tables_with_cost.table_schema
      AND used_tables.table_id = tables_with_cost.table_name
)
SELECT
  project_number,
  project_id,
  table_schema,
  table_name,
  creation_time,
  latest_job_creation_time,
  distinct_user_count,
  job_count,
  SUM(active_logical_terabytes) AS active_logical_terabytes,
  SUM(long_term_logical_terabytes) AS long_term_logical_terabytes,
  SUM(active_physical_terabytes) AS active_physical_terabytes,
  SUM(long_term_physical_terabytes) AS long_term_physical_terabytes,
  SUM(active_logical_dollars) AS active_logical_dollars,
  SUM(long_term_logical_dollars) AS long_term_logical_dollars,
  SUM(active_physical_dollars) AS active_physical_dollars,
  SUM(long_term_physical_dollars) AS long_term_physical_dollars,
  SUM(active_logical_terabytes + long_term_logical_terabytes) AS total_logical_terabytes,
  SUM(active_physical_terabytes + long_term_physical_terabytes) AS total_physical_terabytes,
  SUM(active_logical_dollars + long_term_logical_dollars) AS total_logical_dollars,
  SUM(active_physical_dollars + long_term_physical_dollars) AS total_physical_dollars
FROM
  tables_with_cost_and_usage
GROUP BY
  ALL
ORDER BY
  total_physical_dollars DESC

I do recommend to store the result of this final query for all further analysis.

Next steps #

Based on the above table you can easily:

  1. Identify data unused since at least 6 month by filtering for job_count IS NULL and find the potential savings if this data was deleted.
  2. Identify unpopular data where job_count < small_number and the latest_job_creation_time is a few weeks in the past.
  3. Identify if you can quickly safe cost by changing the Billing model of your dataset from Logical to Physical (see the official Documentation). You should change each dataset to physical billing if total_physical_dollars < total_logical_dollars.
  4. Keep track of your actions, and keep a history of your tables, to identify the impact of the changes made.
Copyright (c) 2025 Nico Hein