BigQuery Storage Optimization #
Over time data easily accumulates. Purging no longer needed data (Bad Data) can save cost and can also reduce the carbon footprint of any Data Warehouse.
In this post, I am describing a simple to identify unused and therefore potentially obsolete data on a table level in BigQuery. This method is easy to reproduce and may help you to also reduce your BigQuery storage cost.
To identify unused data we need:
- an index of all tables that are in scope
- usage statistics for all this data
Table Index #
BigQuery provides great metadata within the INFORMATION_SCHEMA
tables. To create an index of all tables INFORMATION_SCHEMA.TABLE_STORAGE_BY_ORGANIZATION
is a good starting point.
The official documentation can be found here. Depending on where your data is stored, you might need to change the storage pricing applied below.
WITH
table_storage_info AS (
SELECT
project_number,
project_id,
table_schema,
table_name,
creation_time,
SUM(active_logical_bytes)
/ POWER(1024, 4) AS active_logical_terabytes,
SUM(long_term_logical_bytes)
/ POWER(1024, 4) AS long_term_logical_terabytes,
SUM(active_physical_bytes + fail_safe_physical_bytes)
/ POWER(1024, 4) AS active_physical_terabytes,
SUM(long_term_physical_bytes)
/ POWER(1024, 4) AS long_term_physical_terabytes
FROM
[PROJECT_ID].`region-REGION`.INFORMATION_SCHEMA.TABLE_STORAGE_BY_ORGANIZATION
WHERE
NOT STARTS_WITH(table_schema, '_') -- excluding cache and temp tables
GROUP BY
ALL ),
tables_with_cost AS (
SELECT
project_number,
project_id,
table_schema,
table_name,
creation_time,
active_logical_terabytes,
long_term_logical_terabytes,
active_physical_terabytes,
long_term_physical_terabytes,
active_logical_terabytes * 1024 * 0.02 AS active_logical_dollars,
long_term_logical_terabytes * 1024 * 0.01 AS long_term_logical_dollars,
active_physical_terabytes * 1024 * 0.044 AS active_physical_dollars,
long_term_physical_terabytes * 1024 * 0.022 AS long_term_physical_dollars
FROM
table_storage_info )
SELECT * FROM tables_with_cost
Given this index of stored tables, we can now proceed and identify unused and with that, potentially Bad Data.
Identify Unused Data #
We can now left join the list of existing tables with the usage statistics. This eliminates all statistics for tables that no longer exists.
We start with extracting the table details from all Jobs. For this we use the INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION
table and look at both referenced tables and destination tables but exclude tables in cache and temporary tables.
Please note that INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION
currently contains the job history for the last 180 days. If you want to want to have a shorter threshold introduce a WHERE clause. For a longer threshold, you would need to first start maintaining a table containing a longer history.
WITH referenced_tables AS (
SELECT
destination_table.project_id AS project_id,
destination_table.dataset_id AS dataset_id,
destination_table.table_id AS table_id,
creation_time,
user_email
FROM
[PROJECT_ID].`region-REGION`.INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION AS bigquery_jobs ),
destination_tables AS (
SELECT
bigquery_jobs__referenced_tables.project_id AS project_id,
bigquery_jobs__referenced_tables.dataset_id AS dataset_id,
bigquery_jobs__referenced_tables.table_id AS table_id,
creation_time,
user_email
FROM
[PROJECT_ID].`region-REGION`.INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION AS bigquery_jobs
LEFT JOIN
UNNEST( bigquery_jobs.referenced_tables ) AS bigquery_jobs__referenced_tables ),
table_jobs AS (
SELECT
project_id,
dataset_id,
table_id,
creation_time,
user_email
FROM
referenced_tables
UNION ALL
SELECT
project_id,
dataset_id,
table_id,
creation_time,
user_email
FROM
destination_tables ),
used_tables as (
SELECT
project_id,
dataset_id,
table_id,
MAX(creation_time) AS latest_job_creation_time,
COUNT(DISTINCT user_email) AS distinct_user_count,
COUNT(1) AS job_count
FROM
table_jobs
WHERE
NOT STARTS_WITH(dataset_id, '_')
GROUP BY
ALL
)
SELECT * FROM used_tables
Now we can bring the it all together and join the two tables:
WITH tables_with_cost_and_usage AS (
SELECT
tables_with_cost.*,
used_tables.latest_job_creation_time,
used_tables.distinct_user_count,
used_tables.job_count
FROM
tables_with_cost
LEFT JOIN
used_tables
ON
used_tables.project_id = tables_with_cost.project_id
AND used_tables.dataset_id = tables_with_cost.table_schema
AND used_tables.table_id = tables_with_cost.table_name
)
SELECT
project_number,
project_id,
table_schema,
table_name,
creation_time,
latest_job_creation_time,
distinct_user_count,
job_count,
SUM(active_logical_terabytes) AS active_logical_terabytes,
SUM(long_term_logical_terabytes) AS long_term_logical_terabytes,
SUM(active_physical_terabytes) AS active_physical_terabytes,
SUM(long_term_physical_terabytes) AS long_term_physical_terabytes,
SUM(active_logical_dollars) AS active_logical_dollars,
SUM(long_term_logical_dollars) AS long_term_logical_dollars,
SUM(active_physical_dollars) AS active_physical_dollars,
SUM(long_term_physical_dollars) AS long_term_physical_dollars,
SUM(active_logical_terabytes + long_term_logical_terabytes) AS total_logical_terabytes,
SUM(active_physical_terabytes + long_term_physical_terabytes) AS total_physical_terabytes,
SUM(active_logical_dollars + long_term_logical_dollars) AS total_logical_dollars,
SUM(active_physical_dollars + long_term_physical_dollars) AS total_physical_dollars
FROM
tables_with_cost_and_usage
GROUP BY
ALL
ORDER BY
total_physical_dollars DESC
I do recommend to store the result of this final query for all further analysis.
Next steps #
Based on the above table you can easily:
- Identify data unused since at least 6 month by filtering for
job_count IS NULL
and find the potential savings if this data was deleted. - Identify unpopular data where
job_count < small_number
and thelatest_job_creation_time
is a few weeks in the past. - Identify if you can quickly safe cost by changing the Billing model of your dataset from Logical to Physical (see the official Documentation). You should change each dataset to physical billing if
total_physical_dollars < total_logical_dollars
. - Keep track of your actions, and keep a history of your tables, to identify the impact of the changes made.